Architecture¶

Pipeline Overview¶

Microphone (48kHz stereo)
    |
    v
MicCaptureComponent
    |  (float32 → mono downmix → resample 48k→16k → int16)
    |  (software mic gain applied here)
    v
UnifiedAudioBuffer (30-second ring buffer, int16, 16kHz)
    |
    +──> AbolethVADComponent (Silero LSTM streaming inference)
    |        |
    |        |  512-sample windows (32ms at 16kHz)
    |        |  LSTM state persists across calls
    |        |
    |        |  Onset:  probability > threshold for min_speech_duration_ms
    |        |  Offset: probability < threshold for min_silence_duration_ms
    |        |
    |        v
    |    [Speech segment boundaries identified]
    |
    +──> ExtractRange() pulls audio from ring buffer
    |    (with speech_pad_ms reach-back for context)
    |
    +──> [If Streaming Enabled] Growing-window passes every N ms
    |    Local Agreement (n=2) confirms words across consecutive passes
    |    OnTranscriptionUpdated fires with committed + tentative text
    |
    v
WhisperWorker (background thread)
    |
    |  whisper_full() with integrated Silero VAD
    |  GPU inference via CUDA or Vulkan
    |  Optional beam search decoding (final pass)
    |
    v
Transcribed text → Game thread queue → OnUtteranceProcessed delegate

Silero VAD¶

Voice activity detection is handled entirely by Silero VAD, a neural network (LSTM) that runs through whisper.cpp's GGML inference engine. There is no RMS pre-filter for speech detection — Silero is the sole speech detector.

RMS is for display only

RMS levels are computed in MicCaptureComponent and exposed via FAudioLevelInfo for display and metering purposes only (e.g. driving a voice activity meter in your UI). They do not gate or influence VAD decisions.

The Silero model uses a streaming LSTM architecture:

512-sample windows (32ms at 16kHz), ~31 probability readings per second
LSTM hidden/cell state persists across calls, building temporal context
State reset on speech offset so the next utterance starts clean
SoftReset() clears duration counters but preserves LSTM state for warm onset detection between utterances

Pipeline State Machine¶

Uninitialized  ──>  Idle              (system loaded)
Idle           ──>  Accumulating      (Silero onset confirmed)
Accumulating   ──>  Processing        (silence detected, segment extracted)
Accumulating   ──>  Idle              (segment discarded — ring buffer overrun)
Processing     ──>  Idle              (transcription complete)
Processing     ──>  Error             (timeout / exception)
Error          ──>  Idle              (recovery)
Any            ──>  Uninitialized     (system unload)

State	Description
`Uninitialized`	System not yet initialized or model not loaded
`Idle`	System loaded, microphone active, listening for speech
`SpeechDetected`	Silero VAD has detected potential speech onset, accumulating evidence
`Accumulating`	Confirmed speech, actively accumulating audio into current segment
`Processing`	Audio segment being processed by Whisper on a background thread
`Error`	Recoverable error; can be reset to Idle

Transitions fire OnPipelineStateChanged for UI/logic binding. The state machine validates all transitions — invalid transitions are rejected and logged.