Do I need CGO to run Voxray?
Do I need CGO to run Voxray?
It depends on which transports you need.No CGO required — WebSocket only:The default build (WebRTC TTS output requires the Opus encoder, which is a CGO dependency. If you use
CGO_ENABLED=0) produces a fully static binary and supports WebSocket (/ws) and HTTP transports. This is the right choice for most deployments and Docker images.CGO required — WebRTC with Opus audio:transport: "smallwebrtc" or transport: "both" and need TTS audio over WebRTC, build with CGO enabled. You will need gcc in PATH.The Docker image in the repo uses a multi-stage build. If you need WebRTC + TTS, extend the Dockerfile with
CGO_ENABLED=1 and a GCC layer, or use make build-voice for local development.Can I use different providers for STT, LLM, and TTS?
Can I use different providers for STT, LLM, and TTS?
Yes. STT, LLM, and TTS providers are resolved independently. You can freely mix any combination of supported providers.The resolution order is:
stt_provider/llm_provider/tts_provider— task-specific overridesprovider— global fallback for any unset task provider"openai"— final default when nothing is set
How do I add a system prompt or persona?
How do I add a system prompt or persona?
System prompts and personas are not set in This approach keeps persona logic in your application layer, where it belongs — it lets you dynamically swap personas per session without restarting the server or changing config.
config.json. They live in the conversation context that your client sends to the server with each session start or LLM call.When your client connects via WebSocket or WebRTC, include a system message at the top of the messages array in the LLM context payload:My voice agent isn't responding — troubleshooting checklist
My voice agent isn't responding — troubleshooting checklist
Work through this checklist in order:1. VAD thresholdsThe default VAD (Check server logs (
vad_type: "energy") uses RMS energy detection. If your microphone is quiet or noisy, the agent may not detect speech.- Increase sensitivity: lower
vad_threshold(default0.02) — try0.01. - Check
turn_stop_secs(default3): if it’s too long, the agent waits for a long pause before transcribing. - Try
vad_type: "silero"for more robust voice activity detection at the cost of higher CPU.
NotAllowedError or NotFoundError from the WebRTC or WebSocket audio capture.3. PCM audio formatVoxray expects 16-bit signed PCM, mono channel, 16 kHz from the client. Non-standard sample rates or channel counts will cause STT to fail silently or produce garbled transcripts.4. API key validityConfirm the relevant API keys are set and not expired:VOXRAY_LOG_LEVEL=debug) for authentication errors from provider clients.5. Network connectivityEnsure the Voxray server can reach the configured provider API endpoints. In firewalled or air-gapped environments, outbound HTTPS to provider domains must be allowed.How do I scale Voxray horizontally?
How do I scale Voxray horizontally?
Voxray supports horizontal scaling via a shared Redis session store.Single instance (default):No extra setup required. The in-memory session store keeps all session state in the process. Suitable for vertical scaling and single-server deployments.Multiple instances behind a load balancer:When the session store is Redis, the
- Set
session_store: "redis"in config. - Set
redis_urlto your Redis connection string, e.g.redis://redis:6379/0. - Optionally tune
session_ttl_secs(default3600).
/ready endpoint returns 503 if Redis is unreachable — use this as your load balancer readiness probe.Load balancer requirements:WebSocket connections are long-lived. Configure your load balancer to:- Enable WebSocket upgrade support (most modern LBs do this automatically).
- Use sticky sessions (IP hash or cookie affinity) if your LB does not forward session IDs in headers, or ensure all session API calls (
/sessions/{id}/...) are routed by session ID to any instance (Redis makes this possible since state is shared).
The
GET /health endpoint is a pure liveness check with no dependencies. The GET /ready endpoint checks Redis connectivity when Redis is configured. Wire them to separate liveness and readiness probes in Kubernetes.Can I run Voxray fully locally without any cloud services?
Can I run Voxray fully locally without any cloud services?
Yes. You can run a fully local, offline-capable voice agent using:
Set the base URL environment variables to point to your local services:
| Task | Local Provider | Notes |
|---|---|---|
| LLM | ollama | Run ollama serve and ollama pull llama3.2 first |
| STT | sarvam or whisper | Sarvam has a cloud API; for fully offline STT use a local Whisper server (whisper provider + WHISPER_BASE_URL) |
| TTS | xtts | Run a local Coqui XTTS server and set XTTS_BASE_URL |
What audio format does Voxray expect from the client?
What audio format does Voxray expect from the client?
Voxray expects raw audio from the client in the following format:
Send audio as binary WebSocket frames (or the equivalent for your transport) containing consecutive raw PCM samples. The pipeline’s VAD and turn detection operate directly on this byte stream.Common mistakes:
| Property | Value |
|---|---|
| Encoding | 16-bit signed PCM (little-endian) |
| Channels | Mono (1 channel) |
| Sample rate | 16,000 Hz (16 kHz) |
| Container | None — raw bytes, no WAV/MP3 header |
- Sending float32 PCM instead of int16 — causes extremely loud or silent audio.
- Sending stereo (2-channel) audio — channel mixing is not performed; STT accuracy degrades significantly.
- Sending at 44.1 kHz or 48 kHz without resampling — the pipeline does not resample; STT providers expect 16 kHz input.
Some telephony adapters (Twilio, Telnyx, Vonage) send audio in their own codec (e.g. µ-law 8 kHz). Voxray includes serialization helpers for these formats under
pkg/frames/serialize/; the pipeline converts automatically when the correct transport adapter is used.How do I reduce latency?
How do I reduce latency?
Voice agent latency is the sum of STT time + LLM time to first token + TTS time to first audio chunk. Each leg can be independently optimized.STT latencyUse LLM latencyTTS latencyInfrastructure
groq as your STT provider. Groq’s hardware accelerates Whisper large-v3 to typically under 300 ms on short utterances.groqwithllama-3.1-8b-instantdelivers the lowest time-to-first-token (~50–150 ms).cerebraswithllama3.1-8bis an alternative with competitive latency.- Avoid large models (70B+) in latency-sensitive paths unless you have dedicated infrastructure.
elevenlabswitheleven_turbo_v2oreleven_flash_v2_5models minimizes streaming TTS latency.openaiwithtts-1(nottts-1-hd) is faster at the cost of some audio quality.
- Co-locate your Voxray server in the same cloud region as your provider APIs.
- Enable WebSocket write coalescing (
ws_write_coalesce_ms) carefully — it reduces syscalls but adds a small fixed latency budget per flush window. Only enable it if you are syscall-bound. - For the absolute lowest end-to-end latency, use the OpenAI Realtime API (
realtimeprovider) which eliminates the STT → LLM → TTS pipeline altogether in favor of a single real-time model.
Does Voxray support recording calls?
Does Voxray support recording calls?
Yes. Voxray has built-in support for recording conversations and logging transcripts.Audio recording to S3:Enable recording in Or use environment variables:Recordings are streamed asynchronously to S3 from temp files (no full WAV in memory) via a worker pool. Failed uploads are retried with exponential backoff up to Or via environment variables:Both features can be enabled simultaneously. Recording and transcript logging are independent.
config.json:max_retries times.Transcript logging to a database:How do I add a new provider?
How do I add a new provider?
Adding a new STT, LLM, or TTS provider involves four steps:
-
Create a package under
pkg/services/<provider>/implementing the relevant interface(s) frompkg/services/interfaces.go(LLMService,STTService, orTTSService). -
Add a provider constant in
pkg/services/factory.go: -
Register the provider in the appropriate
Supported*Providersslice and add a case toapiKeyForProvider,NewLLMFromConfig,NewSTTFromConfig, orNewTTSFromConfiginfactory.go. -
Add tests under
tests/pkg/covering the new service.
When opening a PR for a new provider, include at least one integration test and document the required environment variables and config keys.