Overview
This tutorial walks through standing up a production-quality voice agent with the best available providers for each pipeline stage:| Stage | Provider | Model | Why |
|---|---|---|---|
| Speech-to-Text | Groq | whisper-large-v3-turbo | Lowest latency STT on the market; generous free tier |
| Language Model | Anthropic | claude-haiku-4-5-20251001 | Best reasoning quality; fast enough for real-time interaction |
| Text-to-Speech | ElevenLabs | Custom voice ID | Most natural, expressive voice synthesis available |
config.json. No code changes are required to switch providers or tune VAD parameters.
Prerequisites
Collect your API keys
You need three API keys before proceeding:
- Groq: console.groq.com — create an API key (
gsk_...) - Anthropic: console.anthropic.com — create an API key (
sk-ant-...) - ElevenLabs: elevenlabs.io — create an API key and note a Voice ID from your voice library
api_keys in config first, then falls back to environment variables (GROQ_API_KEY, ANTHROPIC_API_KEY, ELEVENLABS_API_KEY). Using environment variables is recommended for production deployments.Full Production Config
Createconfig.json at the repository root (or pass --config /path/to/config.json). The file below is the complete production configuration with all relevant fields:
<elevenlabs-voice-id> with the ID from your ElevenLabs voice library (e.g. 21m00Tcm4TlvDq8ikWAM). In production, move all secret values out of config and into environment variables or a secrets manager.
Configuration Deep Dive
Provider Strategy
model sets the LLM model; stt_model sets the STT model. The TTS voice is set by tts_voice.
To switch to Claude Sonnet for higher reasoning quality at the cost of slightly more latency, change
"model" to "claude-sonnet-4-5-20251022". No other config changes are required.Interruptions
allow_interruptions is true, the user can barge in while the bot is speaking. The interruption_strategy controls when an interruption is declared:
| Strategy | Behaviour |
|---|---|
min_words | User must say at least min_words words before the interruption fires |
keyword | Interruption fires only when a specific keyword is detected |
min_words: 3 is the recommended production value. It prevents transient background noise or brief vocalizations (“uh”, “mm”) from cutting off the bot mid-sentence. The user must produce a coherent multi-word utterance before Voxray cancels the current TTS and routes the new transcription to the LLM.
Turn Detection and VAD Tuning
| Parameter | Production Value | Effect |
|---|---|---|
turn_detection | "silence" | Full utterance is collected before LLM is invoked |
turn_stop_secs | 2.5 | Silence of 2.5 s after speech ends the turn |
vad_min_volume | 0.25 | Minimum RMS volume to treat audio as speech |
vad_confidence | 0.75 | VAD model confidence threshold |
- Noisy environments (open offices, call centres): raise
vad_min_volumeto0.3andvad_confidenceto0.8. This makes the VAD more conservative and prevents HVAC, keyboard clicks, or adjacent conversations from triggering false speech starts. - Quiet environments (headsets, recording booths): lower
vad_min_volumeto0.15–0.2so soft-spoken users are not missed. - Responsive feel: lower
turn_stop_secsto1.5–2.0to reduce perceived latency between the user stopping and the bot responding. Trade-off: users who pause mid-sentence may get cut off. - Deliberate speakers: raise
turn_stop_secsto3.0–4.0to give more room for natural pauses.
Session Store
/start creates a session and /sessions/{id}/api/offer retrieves it). For single-instance development, omit session_store to use the in-memory store.
Recording
recording.enable is true, Voxray writes each session’s audio to a temporary file and uploads it to the configured S3-compatible bucket after the session ends. Key details:
worker_countcontrols the upload worker pool. With 4 workers, up to 4 concurrent uploads proceed simultaneously. Tune upward on high-traffic instances.- Workers stream from the temp file to S3 — the full WAV is never held in memory during upload.
- S3 uploads retry with exponential backoff. The temp file is deleted only after a successful upload.
- Set
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY, andAWS_REGION(or equivalent) in the environment; Voxray uses the AWS SDK default credential chain.
Transcripts
sslmode=require in production. The dsn supports the standard libpq connection string format.
Security
server_api_key requires all HTTP requests to include the key in Authorization: Bearer <key>. Set this to a long random string in production.
cors_allowed_origins restricts which browser origins can connect. Pass an exact list of your frontend origins; wildcard ["*"] is not recommended for production.
Starting the Server
json_logs: true and log_level: "info", Voxray emits structured JSON to stdout — compatible with any log aggregation stack (Datadog, Loki, CloudWatch):
Connecting a Client
WebSocket:?rtvi=1 to the WebSocket URL and add "rtvi" to plugins. See the Plugin System reference for details.
Production Checklist
Secrets out of config
Move all values under
api_keys, server_api_key, and transcripts.dsn to environment variables or a secrets manager (AWS Secrets Manager, Vault, Doppler). The config file should contain only non-secret settings.TLS termination
Run Voxray behind a reverse proxy (nginx, Caddy, AWS ALB) that handles TLS. WebRTC and WebSocket both require HTTPS/WSS in production browsers.
Redis persistence
Enable Redis persistence (
appendonly yes) so session state survives Redis restarts. Use Redis Sentinel or Cluster for HA.Horizontal scaling
Multiple Voxray replicas can run behind a load balancer as long as they share the same Redis instance. Each replica builds its own pipeline per transport connection — no shared in-process state.