Skip to main content

What you’ll build

A Voxray server that handles end-to-end voice conversations entirely through OpenAI’s APIs:
  • STT: OpenAI Whisper (gpt-4o-mini-transcribe) — streaming transcription
  • LLM: GPT-4o Mini — fast, capable chat completions with streaming
  • TTS: OpenAI TTS (nova voice) — natural-sounding speech synthesis
Expected latency from end of user speech to first audio byte: ~1.1 seconds (STT ~300ms + LLM first token ~500ms + TTS ~300ms).

Prerequisites

  • Voxray binary built (go build -o voxray ./cmd/voxray) or downloaded
  • An OpenAI API key with active billing credits (platform.openai.com/api-keys)
  • No other accounts or keys required

Steps

1
Set your API key
2
Voxray reads the OpenAI key from config.json or from the environment variable OPENAI_API_KEY. Using the environment variable keeps secrets out of your config file:
3
Environment variable (recommended)
export OPENAI_API_KEY="sk-..."
When OPENAI_API_KEY is set, you can omit api_keys.openai from config.json entirely.
config.json
Place the key directly in your config (see the next step). Suitable for local development; do not commit this file to source control.
4
Write your config
5
Create config.json in your working directory:
6
{
  "host": "localhost",
  "port": 8080,
  "transport": "websocket",
  "stt_provider": "openai",
  "llm_provider": "openai",
  "tts_provider": "openai",
  "model": "gpt-4o-mini",
  "stt_model": "gpt-4o-mini-transcribe",
  "tts_voice": "nova",
  "api_keys": {
    "openai": "sk-..."
  }
}
7
If you exported OPENAI_API_KEY in the previous step you can remove the api_keys block. Voxray checks the environment before the config file.
8
Key fields explained:
9
FieldValuePurposestt_provider"openai"Routes audio through OpenAI’s transcription endpointllm_provider"openai"Routes transcripts to OpenAI chat completionstts_provider"openai"Synthesises LLM output via OpenAI TTSmodel"gpt-4o-mini"LLM model for chat completionsstt_model"gpt-4o-mini-transcribe"Whisper model variant for transcriptiontts_voice"nova"Voice character for synthesised speechtransport"websocket"Enables the /ws WebSocket endpoint
10
Start Voxray
11
./voxray -config config.json
12
You should see output like:
13
Voxray listening on localhost:8080
transport: websocket  stt: openai  llm: openai  tts: openai
14
Connect a client
15
Open a WebSocket connection to ws://localhost:8080/ws. Any client that sends raw PCM audio frames (16-bit, 16kHz, mono) and receives audio frames back will work.
16
Voxray web client
Open http://localhost:8080 in your browser if you have the bundled web client enabled. Click Connect, then Start speaking.
wscat (quick test)
npm install -g wscat
wscat -c ws://localhost:8080/ws
The server will respond with JSON frames for transcriptions and text, and binary frames for audio.
RTVI client
Connect with ?rtvi=1 query parameter to use the RTVI protocol:
ws://localhost:8080/ws?rtvi=1
Compatible with any RTVI-spec client SDK.
17
Test the pipeline
18
Speak a sentence. You should observe:
19
  • A TranscriptionFrame JSON message appears within ~300ms of you finishing speaking (Whisper STT)
  • A TextFrame starts streaming back within ~500ms of the transcript arriving (GPT-4o Mini)
  • Audio bytes begin arriving within ~300ms of the first text token (OpenAI TTS)
  • 20
    Total time from silence to first audio: ~1.1 seconds under normal network conditions.

    Available models

    Use the tables below to tune the trade-off between cost, speed, and quality.

    LLM models (model)

    ModelSpeedQualityBest for
    gpt-4o-miniFastHighDefault; best cost/quality ratio
    gpt-4oMediumHighestComplex reasoning, nuanced conversations
    gpt-4.1-miniFastHighLatest mini-class, improved instruction following
    gpt-4.1MediumHighestLatest full model, strong tool use

    STT models (stt_model)

    ModelLatencyNotes
    gpt-4o-mini-transcribe~300msRecommended; good accuracy, low cost
    whisper-1~400msClassic Whisper; higher cost, similar accuracy

    TTS voices (tts_voice)

    VoiceCharacter
    novaWarm, conversational (recommended default)
    alloyNeutral, balanced
    echoMeasured, authoritative
    fableBritish, expressive
    onyxDeep, confident
    shimmerClear, optimistic

    Cost estimate

    Rough cost for one hour of continuous conversation (OpenAI list prices as of May 2026):
    • STT (gpt-4o-mini-transcribe): ~0.003/min×60= 0.003/min × 60 = **~0.18/hour**
    • LLM (gpt-4o-mini): depends heavily on conversation length; typical voice session ~2k tokens/min → ~$0.12–0.30/hour
    • TTS: ~0.015/1kchars× 300chars/min×60= 0.015/1k chars × ~300 chars/min × 60 = **~0.27/hour**
    Total estimate: $0.57–0.75 per hour of conversation. Switch to gpt-4o for best quality at roughly 5–8× the LLM cost. These are rough estimates; check platform.openai.com/pricing for current rates.

    Upgrading to GPT-4o Realtime

    OpenAI’s Realtime API replaces the separate STT → LLM → TTS chain with a single WebSocket, cutting latency significantly (typically under 500ms TTFR) at a higher per-minute cost. Voxray supports it via runner_transport + the realtime integration. See OpenAI Realtime integration for setup instructions.

    Troubleshooting

    SymptomLikely causeFix
    401 Unauthorized from OpenAIInvalid or missing API keyVerify OPENAI_API_KEY is set and the key is active
    No transcription appearingAudio not reaching STTConfirm your client is sending 16kHz PCM audio
    Long pauses before LLM respondsTurn detection holdingLower turn_stop_secs in config (default 3.0s); try 1.5
    TTS audio choppyNetwork jitterCheck your connection; TTS streams in chunks and may buffer
    insufficient_quota errorOpenAI account has no creditsAdd billing at platform.openai.com/settings/billing