Orchestration Models
On top of STT, LLM, and TTS, VoiceHub runs real-time orchestration models that enhance the conversation quality — making it feel fast, human, and responsive.
These models are part of what we call the orchestration layer.
Overview of Active Models
Endpointing – Detects when the user has stopped speaking using an audio-text fusion model
Interruptions – Lets users interrupt agents mid-speech in a natural way
Background Noise Filtering – Removes ambient sounds like traffic or music
Background Voice Filtering – Ignores irrelevant speech like TVs, echoes, or other people nearby
Backchanneling – Adds cues like “yeah” or “got it” to keep the user engaged
Emotion Detection – Detects the user's emotion and passes it to the LLM for empathetic handling
Filler Injection – Makes responses more natural with human-like fillers (“um”, “so”, “like”)
Endpointing
Most systems rely on silence timeouts to know when to respond — but that creates delay. VoiceHub uses a fusion audio-text model that listens for the user’s tone and words to determine when to respond. This enables sub-second handoffs without cutting people off.
Interruptions (Barge-In)
Our interruption model determines whether a user interjection is intended to cancel the assistant ("stop, wait, that’s not right") versus affirming the assistant ("yeah, ok"). It updates the context so the LLM knows what was missed.
Background Noise Filtering
Environmental sounds like horns, music, or machinery can confuse voice models. VoiceHub uses real-time filtering to clean the audio stream before it reaches the STT engine — preserving accuracy without adding latency.
Background Voice Filtering
Unlike noise, background speech can be dangerous — such as a nearby TV or someone else talking. VoiceHub isolates the primary speaker and filters out overlapping voices to maintain transcription integrity.
Backchanneling
Small affirmations like “uh-huh,” “yep,” or “oh no” help make AI sound more attentive. VoiceHub injects these at the right moment without interrupting the user's thought.
Emotion Detection
VoiceHub analyzes vocal tone and inflection to detect emotions like frustration or confusion. That data is passed to the LLM so it can respond with more empathy and adjust behavior dynamically.
Filler Injection
LLMs often sound too robotic. Instead of asking users to prompt for informal tone, we use a filler injection model to make output sound more human by adding subtle interjections in real time.
Together, these orchestration models make VoiceHub agents sound not just smart — but human.
Last updated