What you’ll build
A Voxray server that handles end-to-end voice conversations entirely through OpenAI’s APIs:- STT: OpenAI Whisper (
gpt-4o-mini-transcribe) — streaming transcription - LLM: GPT-4o Mini — fast, capable chat completions with streaming
- TTS: OpenAI TTS (
novavoice) — natural-sounding speech synthesis
Prerequisites
- Voxray binary built (
go build -o voxray ./cmd/voxray) or downloaded - An OpenAI API key with active billing credits (platform.openai.com/api-keys)
- No other accounts or keys required
Steps
Voxray reads the OpenAI key from
config.json or from the environment variable OPENAI_API_KEY. Using the environment variable keeps secrets out of your config file: Environment variable (recommended)
OPENAI_API_KEY is set, you can omit api_keys.openai from config.json entirely. config.json
Place the key directly in your config (see the next step). Suitable for local development; do not commit this file to source control.
{
"host": "localhost",
"port": 8080,
"transport": "websocket",
"stt_provider": "openai",
"llm_provider": "openai",
"tts_provider": "openai",
"model": "gpt-4o-mini",
"stt_model": "gpt-4o-mini-transcribe",
"tts_voice": "nova",
"api_keys": {
"openai": "sk-..."
}
}
If you exported
OPENAI_API_KEY in the previous step you can remove the api_keys block. Voxray checks the environment before the config file.stt_provider"openai"llm_provider"openai"tts_provider"openai"model"gpt-4o-mini"stt_model"gpt-4o-mini-transcribe"tts_voice"nova"transport"websocket"/ws WebSocket endpointOpen a WebSocket connection to
ws://localhost:8080/ws. Any client that sends raw PCM audio frames (16-bit, 16kHz, mono) and receives audio frames back will work. Voxray web client
Open
http://localhost:8080 in your browser if you have the bundled web client enabled. Click Connect, then Start speaking. wscat (quick test)
RTVI client
Connect with Compatible with any RTVI-spec client SDK.
?rtvi=1 query parameter to use the RTVI protocol:TranscriptionFrame JSON message appears within ~300ms of you finishing speaking (Whisper STT)TextFrame starts streaming back within ~500ms of the transcript arriving (GPT-4o Mini)Available models
Use the tables below to tune the trade-off between cost, speed, and quality.LLM models (model)
| Model | Speed | Quality | Best for |
|---|---|---|---|
gpt-4o-mini | Fast | High | Default; best cost/quality ratio |
gpt-4o | Medium | Highest | Complex reasoning, nuanced conversations |
gpt-4.1-mini | Fast | High | Latest mini-class, improved instruction following |
gpt-4.1 | Medium | Highest | Latest full model, strong tool use |
STT models (stt_model)
| Model | Latency | Notes |
|---|---|---|
gpt-4o-mini-transcribe | ~300ms | Recommended; good accuracy, low cost |
whisper-1 | ~400ms | Classic Whisper; higher cost, similar accuracy |
TTS voices (tts_voice)
| Voice | Character |
|---|---|
nova | Warm, conversational (recommended default) |
alloy | Neutral, balanced |
echo | Measured, authoritative |
fable | British, expressive |
onyx | Deep, confident |
shimmer | Clear, optimistic |
Cost estimate
Upgrading to GPT-4o Realtime
OpenAI’s Realtime API replaces the separate STT → LLM → TTS chain with a single WebSocket, cutting latency significantly (typically under 500ms TTFR) at a higher per-minute cost. Voxray supports it viarunner_transport + the realtime integration. See OpenAI Realtime integration for setup instructions.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
401 Unauthorized from OpenAI | Invalid or missing API key | Verify OPENAI_API_KEY is set and the key is active |
| No transcription appearing | Audio not reaching STT | Confirm your client is sending 16kHz PCM audio |
| Long pauses before LLM responds | Turn detection holding | Lower turn_stop_secs in config (default 3.0s); try 1.5 |
| TTS audio choppy | Network jitter | Check your connection; TTS streams in chunks and may buffer |
insufficient_quota error | OpenAI account has no credits | Add billing at platform.openai.com/settings/billing |