Skip to content

Tuning Guide

Practical solutions for common speech detection issues. Each section describes a symptom, explains why it happens, and gives specific settings to fix it.

Start with defaults

The default settings work well for a typical desktop microphone in a quiet room. Only adjust these if you are experiencing a specific problem.


Too much background noise triggers detection

Symptom: The system fires transcription events when nobody is speaking — fans, music, keyboard clicks, or ambient noise cause false positives.

Why: The VAD threshold is too low for the noise floor in the environment.

Fix:

  • Increase threshold from 0.5 to 0.6 -- 0.7. This requires higher neural network confidence before onset is triggered.
  • Increase min_speech_duration_ms from 250 to 400 -- 600. This forces the VAD to see sustained speech-like audio before committing, filtering out brief noise bursts.
UAbolethSTTSubsystem* STT = UAbolethSTTSubsystem::GetSTTSubsystem(this);
STT->SetVADThreshold(0.7f);

Call Set VAD Threshold on the subsystem with a value of 0.7.

[/Script/AbolethSTT.WhisperDeveloperSettings]
threshold=0.7
min_speech_duration_ms=500

It cuts off my sentences

Symptom: The system triggers offset (end of speech) too early. Words at the end of sentences are lost, especially during natural pauses mid-thought.

Why: The min_silence_duration_ms is too short. Even a brief pause between words or clauses causes the VAD to finalize the segment.

Fix:

  • Increase min_silence_duration_ms from 100 to 300 -- 500. This gives the speaker more room to pause without ending the segment.
UAbolethSTTSubsystem* STT = UAbolethSTTSubsystem::GetSTTSubsystem(this);
// Allow up to 400ms of silence before finalizing
STT->ReloadVADSettings(); // after updating the developer settings object
[/Script/AbolethSTT.WhisperDeveloperSettings]
min_silence_duration_ms=400

Warning

Setting min_silence_duration_ms above 800ms can make the system feel sluggish — the player has to wait a noticeable amount of time after finishing a sentence before the transcription appears. Values between 300 and 500 are a good balance.


Short words like "yes" or "no" get dropped

Symptom: Very short utterances are ignored. The player says "yes" and nothing happens.

Why: The min_speech_duration_ms is longer than the utterance itself. A quick "yes" might only be 150 -- 200ms of voiced audio, which falls below the default 250ms onset threshold.

Fix:

  • Lower min_speech_duration_ms from 250 to 100 -- 150. This allows shorter bursts of speech to register.
  • Optionally lower threshold to 0.4 if the short words are also quiet.
[/Script/AbolethSTT.WhisperDeveloperSettings]
threshold=0.4
min_speech_duration_ms=100

Trade-off

Lowering both threshold and min_speech_duration_ms makes the system more sensitive to all sounds, not just speech. If you are in a noisy environment, you may need to compensate by increasing InputGainDb instead of lowering the threshold — a louder speech signal is easier for Silero to distinguish from noise.


My microphone is too quiet

Symptom: The system detects speech inconsistently or not at all. The audio level meter shows very low values. Quiet speakers or low-gain microphones are particularly affected.

Why: The raw microphone signal is too weak for the VAD to confidently distinguish speech from silence.

Fix:

  • Increase InputGainDb. Start with 6.0 (approximately 2x amplification). If still too quiet, try 12.0 (approximately 4x).
UAbolethSTTSubsystem* STT = UAbolethSTTSubsystem::GetSTTSubsystem(this);
STT->SetMicGainDb(6.0f);

Call Set Mic Gain Db on the subsystem with a value of 6.0.

[/Script/AbolethSTT.WhisperDeveloperSettings]
InputGainDb=6.0

Gain before threshold

Always try increasing gain before lowering the VAD threshold. Amplifying the signal raises speech above the noise floor, giving the VAD a cleaner decision boundary. Lowering the threshold instead makes the VAD more sensitive to everything, including noise.


Streaming text appears too slowly

Symptom: When streaming is enabled, the live transcription updates feel laggy. There is a noticeable delay between speaking and seeing partial results.

Why: The StreamingChunkIntervalMs determines how often a streaming inference pass runs. The default of 1000ms means partial results update at most once per second.

Fix:

  • Lower StreamingChunkIntervalMs from 1000 to 500. This doubles the update rate.
  • Going below 300ms is possible but increases GPU utilization. On lower-end hardware, this can cause frame drops.
UAbolethSTTSubsystem* STT = UAbolethSTTSubsystem::GetSTTSubsystem(this);
STT->SetStreamingChunkIntervalMs(500);

Call Set Streaming Chunk Interval Ms on the subsystem with a value of 500.

[/Script/AbolethSTT.WhisperDeveloperSettings]
StreamingChunkIntervalMs=500

GPU budget

Each streaming pass runs a full Whisper encoder + decoder inference on the accumulated audio. At 500ms intervals, you get two passes per second. Monitor your frame times with Debug & Profiling to ensure the GPU budget stays comfortable for your target hardware.


These are starting points. Adjust to taste based on your hardware and environment.

Setting Value
threshold 0.5
min_speech_duration_ms 250
min_silence_duration_ms 100
InputGainDb 0.0
StreamingChunkIntervalMs 1000
Setting Value
threshold 0.7
min_speech_duration_ms 500
min_silence_duration_ms 200
InputGainDb 0.0
StreamingChunkIntervalMs 1000
Setting Value
threshold 0.4
min_speech_duration_ms 100
min_silence_duration_ms 300
InputGainDb 6.0
StreamingChunkIntervalMs 500