Skip to content

Architecture

Pipeline Overview

Microphone (48kHz stereo)
    |
    v
MicCaptureComponent
    |  (float32 → mono downmix → resample 48k→16k → int16)
    |  (software mic gain applied here)
    v
UnifiedAudioBuffer (30-second ring buffer, int16, 16kHz)
    |
    +──> AbolethVADComponent (Silero LSTM streaming inference)
    |        |
    |        |  512-sample windows (32ms at 16kHz)
    |        |  LSTM state persists across calls
    |        |
    |        |  Onset:  probability > threshold for min_speech_duration_ms
    |        |  Offset: probability < threshold for min_silence_duration_ms
    |        |
    |        v
    |    [Speech segment boundaries identified]
    |
    +──> ExtractRange() pulls audio from ring buffer
    |    (with speech_pad_ms reach-back for context)
    |
    +──> [If Streaming Enabled] Growing-window passes every N ms
    |    Local Agreement (n=2) confirms words across consecutive passes
    |    OnTranscriptionUpdated fires with committed + tentative text
    |
    v
WhisperWorker (background thread)
    |
    |  whisper_full() with integrated Silero VAD
    |  GPU inference via CUDA or Vulkan
    |  Optional beam search decoding (final pass)
    |
    v
Transcribed text → Game thread queue → OnUtteranceProcessed delegate

Silero VAD

Voice activity detection is handled entirely by Silero VAD, a neural network (LSTM) that runs through whisper.cpp's GGML inference engine. There is no RMS pre-filter for speech detection — Silero is the sole speech detector.

RMS is for display only

RMS levels are computed in MicCaptureComponent and exposed via FAudioLevelInfo for display and metering purposes only (e.g. driving a voice activity meter in your UI). They do not gate or influence VAD decisions.

The Silero model uses a streaming LSTM architecture:

  • 512-sample windows (32ms at 16kHz), ~31 probability readings per second
  • LSTM hidden/cell state persists across calls, building temporal context
  • State reset on speech offset so the next utterance starts clean
  • SoftReset() clears duration counters but preserves LSTM state for warm onset detection between utterances

Pipeline State Machine

Uninitialized  ──>  Idle              (system loaded)
Idle           ──>  Accumulating      (Silero onset confirmed)
Accumulating   ──>  Processing        (silence detected, segment extracted)
Accumulating   ──>  Idle              (segment discarded — ring buffer overrun)
Processing     ──>  Idle              (transcription complete)
Processing     ──>  Error             (timeout / exception)
Error          ──>  Idle              (recovery)
Any            ──>  Uninitialized     (system unload)
State Description
Uninitialized System not yet initialized or model not loaded
Idle System loaded, microphone active, listening for speech
SpeechDetected Silero VAD has detected potential speech onset, accumulating evidence
Accumulating Confirmed speech, actively accumulating audio into current segment
Processing Audio segment being processed by Whisper on a background thread
Error Recoverable error; can be reset to Idle

Transitions fire OnPipelineStateChanged for UI/logic binding. The state machine validates all transitions — invalid transitions are rejected and logged.