# Orchestration Models

On top of STT, LLM, and TTS, VoiceHub runs real-time orchestration models that enhance the conversation quality — making it feel fast, human, and responsive.

These models are part of what we call the orchestration layer.

### **Overview of Active Models**

* Endpointing – Detects when the user has stopped speaking using an audio-text fusion model
* Interruptions – Lets users interrupt agents mid-speech in a natural way
* Background Noise Filtering – Removes ambient sounds like traffic or music
* Background Voice Filtering – Ignores irrelevant speech like TVs, echoes, or other people nearby
* Backchanneling – Adds cues like “yeah” or “got it” to keep the user engaged
* Emotion Detection – Detects the user's emotion and passes it to the LLM for empathetic handling
* Filler Injection – Makes responses more natural with human-like fillers (“um”, “so”, “like”)

### **Endpointing**

Most systems rely on silence timeouts to know when to respond — but that creates delay. VoiceHub uses a fusion audio-text model that listens for the user’s tone and words to determine when to respond. This enables sub-second handoffs without cutting people off.

### **Interruptions (Barge-In)**

Our interruption model determines whether a user interjection is intended to cancel the assistant ("stop, wait, that’s not right") versus affirming the assistant ("yeah, ok"). It updates the context so the LLM knows what was missed.

### **Background Noise Filtering**

Environmental sounds like horns, music, or machinery can confuse voice models. VoiceHub uses real-time filtering to clean the audio stream before it reaches the STT engine — preserving accuracy without adding latency.

### **Background Voice Filtering**

Unlike noise, background speech can be dangerous — such as a nearby TV or someone else talking. VoiceHub isolates the primary speaker and filters out overlapping voices to maintain transcription integrity.

### **Backchanneling**

Small affirmations like “uh-huh,” “yep,” or “oh no” help make AI sound more attentive. VoiceHub injects these at the right moment without interrupting the user's thought.

### **Emotion Detection**

VoiceHub analyzes vocal tone and inflection to detect emotions like frustration or confusion. That data is passed to the LLM so it can respond with more empathy and adjust behavior dynamically.

### **Filler Injection**

LLMs often sound too robotic. Instead of asking users to prompt for informal tone, we use a filler injection model to make output sound more human by adding subtle interjections in real time.

Together, these orchestration models make VoiceHub agents sound not just smart — but human.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://dataqueue.gitbook.io/voicehub-docs/getting-started/orchestration-models.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
