Capabilities
STT
Whisper and GPT-4o transcription models
LLM
GPT-4o, GPT-4.1, and GPT-3.5 family with full tool-calling support
TTS
Six neural voices via the OpenAI audio API
Realtime
Bidirectional audio session via
gpt-4o-realtime-previewAPI Key
SetOPENAI_API_KEY as an environment variable or pass it inline under api_keys in config.json.
Get your key at platform.openai.com/api-keys.
Quick Config
- config.json
- Environment variable
Available LLM Models
| Model | Context | Description |
|---|---|---|
gpt-4o | 128k | Flagship multimodal model; highest reasoning quality in the GPT-4o family |
gpt-4o-mini | 128k | Fast and cost-efficient; recommended default for most voice agent workloads |
gpt-4.1 | 1M | Long-context reasoning model with improved instruction following |
gpt-4.1-mini | 1M | Smaller variant of GPT-4.1; better throughput than gpt-4.1 at lower cost |
gpt-4.1-nano | 1M | Ultra-light; optimized for latency-sensitive tasks with simple instructions |
model is empty and llm_provider is "openai" is gpt-3.5-turbo.
Available STT Models
| Model | Description |
|---|---|
whisper-1 | Original Whisper model hosted by OpenAI; broad language support |
gpt-4o-mini-transcribe | GPT-4o Mini transcription; faster and cheaper than gpt-4o-transcribe — recommended default |
gpt-4o-transcribe | Highest accuracy transcription; use when word-error-rate matters more than latency |
stt_model in config. The default STT service (stt.NewOpenAI) uses whisper-1 when stt_model is not specified.
TTS Voices
| Voice | Character |
|---|---|
alloy | Neutral, balanced; works well for informational agents |
echo | Deep and resonant; suits formal or authoritative personas |
fable | Warm and expressive; good for storytelling or friendly agents |
onyx | Clear and confident; popular for customer-service workloads |
nova | Bright and energetic; recommended default for most voice agents |
shimmer | Soft and calm; suits wellness, meditation, or low-stress contexts |
tts_voice in config. The TTS model (e.g. tts-1, tts-1-hd) can be set via tts_model.
Tool Calling (MCP)
OpenAI is one of two providers in Voxray that implementLLMServiceWithTools. When an MCP server is configured (mcp.command in config), Voxray registers discovered tools with the OpenAI service, and the model can invoke them during a turn.
Tool calls are streamed, accumulated across chunks, executed in index order, and the results are appended to the conversation before a recursive Chat call completes the response. This is fully transparent to the rest of the pipeline.
Realtime Mode
OpenAI Realtime runs a persistent bidirectional audio session usinggpt-4o-realtime-preview. Instead of the STT → LLM → TTS pipeline, audio is streamed directly to OpenAI and responses arrive as audio — significantly reducing first-audio latency.
realtime.NewFromConfig(cfg, "openai") in code; avoid importing the realtime package directly from services to prevent an import cycle. Realtime mode requires a voice-enabled build (CGO_ENABLED=1) when using WebRTC transport.
Configuration Reference
| Key | Type | Description |
|---|---|---|
stt_provider | string | Set to "openai" |
stt_model | string | STT model (e.g. "gpt-4o-mini-transcribe") |
llm_provider | string | Set to "openai" |
model | string | LLM chat model (e.g. "gpt-4o-mini") |
tts_provider | string | Set to "openai" |
tts_voice | string | TTS voice name (e.g. "nova") |
tts_model | string | TTS model (e.g. "tts-1" or "tts-1-hd") |
api_keys.openai | string | OpenAI API key (falls back to OPENAI_API_KEY) |
provider | string | Set to "openai" for Realtime mode (replaces llm_provider) |
Notes and Limitations
- The factory treats
"openai"as the default LLM and STT provider. Ifprovider,stt_provider,llm_provider, andtts_providerare all unset, Voxray falls back to OpenAI for all three stages. tts-1-hdproduces higher-quality audio at the cost of higher latency;tts-1is recommended for real-time voice.- Tool calling requires
llm_provider: "openai"— it is not available when using the Realtime session path. - Realtime mode and the STT→LLM→TTS pipeline are mutually exclusive per session. Choose one via config.