Minimal working config
Copyconfig.example.json to config.json, then replace its contents with the minimal config below. The example file contains every available option — useful as a reference, but noisy when learning. Start here instead.
config.json
Replace
YOUR_OPENAI_API_KEY with a real key before starting the server. Alternatively, set the OPENAI_API_KEY environment variable and omit the api_keys block entirely — Voxray falls back to environment variables automatically.Config field walkthrough
Server: host and port
host controls which network interface Voxray binds to. localhost means only processes on the same machine can connect — correct for local development. Change to 0.0.0.0 when you deploy to a server and need external traffic.
port defaults to 8080 in Voxray’s built-in defaults, but 3042 is the value in config.example.json and is used throughout this documentation. The browser client at web/index.html points to this port by default; if you change it, update the client’s server URL to match.
Transport: websocket to start simple
| Value | What it enables |
|---|---|
"websocket" | WebSocket only (/ws) — no CGO, no C compiler needed |
"smallwebrtc" | WebRTC only (/webrtc/offer) — requires CGO + gcc for Opus |
"both" | Both transports on the same HTTP server |
"websocket". It has no build dependencies, works in every browser, and makes debugging straightforward: you can inspect WebSocket frames directly in Chrome DevTools. Switch to "both" or "smallwebrtc" once your pipeline logic is correct and you want lower-latency audio delivery.
Provider selection: provider, stt_provider, llm_provider, tts_provider
provider is a fallback default. When Voxray resolves which provider to use for each pipeline stage, it checks the task-specific key first (stt_provider, llm_provider, tts_provider) and falls back to provider only if the task-specific key is absent or empty. This means you can set "provider": "openai" as a catch-all and then override individual stages selectively.
For example, to use Groq for STT (faster Whisper) but keep OpenAI for LLM and TTS:
provider → error.
Model selection: model, stt_model, tts_voice
model sets the LLM model. stt_model sets the transcription model; when omitted, Voxray uses the provider’s default. tts_voice selects the speaker voice — valid values depend on your TTS provider:
| Provider | Example voices |
|---|---|
| OpenAI | alloy, nova, echo, fable, onyx, shimmer |
| ElevenLabs | Use the voice ID from ElevenLabs dashboard |
| Sarvam | anushka, arvind, and other Sarvam voice names |
Voice names from Cloud TTS (e.g. en-US-Neural2-A) |
tts_model in this minimal config because OpenAI’s TTS API selects the model based on the voice. For providers that distinguish model from voice (Sarvam, ElevenLabs), add "tts_model": "..." alongside tts_voice.
API keys: api_keys
api_keys is a flat map from provider name to key. Voxray checks this map first, then falls back to environment variables using standard names (OPENAI_API_KEY, GROQ_API_KEY, etc.). You never need both; the map is convenient for local dev and environment variables are better for CI and production.
For multi-provider configs, list each provider’s key under its name:
Turn detection: turn_detection and turn_stop_secs
turn_detection controls when Voxray decides the user has finished speaking and sends the accumulated audio to STT. There are two values:
"none"— no automatic turn detection; the client signals turn end explicitly. Useful when building custom UI."silence"— silence gate: Voxray listens forturn_stop_secsseconds of continuous silence after speech activity, then closes the turn and flushes to STT.
turn_stop_secs: 3.0 means the agent waits for 3 seconds of silence before it considers the user done speaking. This is the most important tuning knob for conversational feel. Lower values (1.0–1.5s) feel snappier but cut off users who pause mid-sentence. Higher values (4–5s) feel more patient but add latency. Start at 3.0 and adjust based on your use case and user feedback.
VAD: vad_min_volume
vad_min_volume sets the minimum RMS amplitude a chunk must exceed before VAD considers it “speech” — chunks below this threshold are treated as silence regardless of content.
The practical effect: vad_min_volume is your noise floor gate. If the agent is triggering on air conditioning, keyboard clicks, or ambient room noise, raise vad_min_volume (try 0.3–0.4). If the agent is missing quiet speech or the user’s microphone has low gain, lower it (try 0.1–0.15).
The default in config.example.json is 0.25; 0.2 in this minimal config is slightly more sensitive, which works well for most laptop microphones in a quiet room.
Plugins: plugins
echo, frame_filter, wake_check_filter, stt_mute_filter, audio_filter, interruption_controller, external_chain, and rtvi. Each has its own options block under plugin_options. See the Extensions documentation for details.
Adding a system prompt
Voxray’s config file controls pipeline behavior — provider selection, transport, VAD parameters — but the system prompt lives in the LLM context, not inconfig.json. The LLM context is assembled per-session from the first message in the conversation context passed by the client.
When a client connects over WebSocket, it sends an initial context payload before audio begins. That payload includes a messages array. The first message in that array with role: "system" becomes the system prompt the LLM sees on every turn.
A minimal context payload sent by the client looks like:
web/index.html (or tests/frontend/webrtc-voice.html for WebRTC) sends this context automatically when you configure it. For custom clients, send the context message immediately after the WebSocket connection is established, before sending any audio frames.
Starting and testing your agent
Build the binary
From the repository root, build Voxray for WebSocket-only use (no CGO required):On Windows:If you have
make available, make build does the same thing.Start the server
listening on :3042. If the port is already in use, change port in config.json to another value (e.g. 3043) and restart.Open the browser client
Open Alternatively, use the WebRTC test client (requires a voice build):Set the server URL field in the client to
web/index.html directly in your browser (drag the file into a browser tab, or use a local file server):ws://localhost:3042/ws.Say hello
Click the microphone button to begin a session. Wait for the connection indicator to turn green, then say “Hello” clearly.Pause for at least
turn_stop_secs (3 seconds) after speaking. Voxray’s silence gate needs that pause to detect end of turn and dispatch your audio to STT.The agent should respond with audio within 2–3 seconds of the silence gate triggering.Verification checklist
Common first-time issues
Agent responds with a robotic or wrong voice Yourtts_voice value does not match a valid voice ID for your TTS provider. Each provider has its own voice name format — OpenAI uses short names like alloy and nova, ElevenLabs uses UUIDs or slug names from your dashboard, and Sarvam uses names like anushka. Check the provider’s documentation for the exact list of valid voice identifiers and update tts_voice to match.
Agent does not respond after you speak
The most common cause is the silence gate not firing. Make sure turn_detection is set to "silence" and that you paused speaking for the full turn_stop_secs duration (3 seconds by default). The gate requires a continuous unbroken silence — if your environment has frequent ambient noise bursts (fans, typing), the gate keeps resetting. Try raising vad_min_volume to 0.3 or 0.4 so low-amplitude background noise does not count as speech.
If you are certain you paused long enough, check that the WebSocket connection is established (green indicator in the client) and that the server is receiving audio frames (look for vad lines in server logs).
No audio from the agent
Two possible causes:
- Browser speaker permission denied — open browser Settings and verify microphone and speaker permissions are granted for the page. Some browsers also require a user gesture (a click) before playing audio; the microphone button in the client satisfies this.
-
TTS provider API key is invalid or missing — check the server log for a TTS error line. If the API key in
api_keys.openaiis wrong, the TTS call will return a 401 and Voxray logs the error. Fix the key and restart the server.
config.json has a JSON syntax error. Run cat config.json | python3 -m json.tool or paste it into jsonlint.com to find the offending line.
Port 3042 already in use
Another process is bound to that port. Either stop it (lsof -i :3042 on macOS/Linux) or change port in your config to a free port.
Next steps
Core Concepts
Understand how the STT → LLM → TTS pipeline works, what frames are, and how turn detection and VAD fit together.
Tutorials
Step-by-step tutorials from a zero-dependency echo bot to a production multi-provider pipeline with telephony.
Deployment
Run Voxray in production: Docker, environment variables, TLS termination, Prometheus metrics, and horizontal scaling.
Integrations
Connect Voxray to telephony (Twilio, Telnyx, Plivo), WebRTC, Daily.co, and external LLM chains.