Skip to main content

Capabilities

STT

Whisper and GPT-4o transcription models

LLM

GPT-4o, GPT-4.1, and GPT-3.5 family with full tool-calling support

TTS

Six neural voices via the OpenAI audio API

Realtime

Bidirectional audio session via gpt-4o-realtime-preview

API Key

Set OPENAI_API_KEY as an environment variable or pass it inline under api_keys in config.json. Get your key at platform.openai.com/api-keys.

Quick Config

{
  "stt_provider": "openai",
  "stt_model": "gpt-4o-mini-transcribe",
  "llm_provider": "openai",
  "model": "gpt-4o-mini",
  "tts_provider": "openai",
  "tts_voice": "nova",
  "api_keys": {
    "openai": "sk-..."
  }
}

Available LLM Models

ModelContextDescription
gpt-4o128kFlagship multimodal model; highest reasoning quality in the GPT-4o family
gpt-4o-mini128kFast and cost-efficient; recommended default for most voice agent workloads
gpt-4.11MLong-context reasoning model with improved instruction following
gpt-4.1-mini1MSmaller variant of GPT-4.1; better throughput than gpt-4.1 at lower cost
gpt-4.1-nano1MUltra-light; optimized for latency-sensitive tasks with simple instructions
The factory default when model is empty and llm_provider is "openai" is gpt-3.5-turbo.

Available STT Models

ModelDescription
whisper-1Original Whisper model hosted by OpenAI; broad language support
gpt-4o-mini-transcribeGPT-4o Mini transcription; faster and cheaper than gpt-4o-transcribe — recommended default
gpt-4o-transcribeHighest accuracy transcription; use when word-error-rate matters more than latency
Set via stt_model in config. The default STT service (stt.NewOpenAI) uses whisper-1 when stt_model is not specified.

TTS Voices

VoiceCharacter
alloyNeutral, balanced; works well for informational agents
echoDeep and resonant; suits formal or authoritative personas
fableWarm and expressive; good for storytelling or friendly agents
onyxClear and confident; popular for customer-service workloads
novaBright and energetic; recommended default for most voice agents
shimmerSoft and calm; suits wellness, meditation, or low-stress contexts
Set via tts_voice in config. The TTS model (e.g. tts-1, tts-1-hd) can be set via tts_model.

Tool Calling (MCP)

OpenAI is one of two providers in Voxray that implement LLMServiceWithTools. When an MCP server is configured (mcp.command in config), Voxray registers discovered tools with the OpenAI service, and the model can invoke them during a turn. Tool calls are streamed, accumulated across chunks, executed in index order, and the results are appended to the conversation before a recursive Chat call completes the response. This is fully transparent to the rest of the pipeline.
{
  "llm_provider": "openai",
  "model": "gpt-4o-mini",
  "mcp": {
    "command": "npx",
    "args": ["-y", "@modelcontextprotocol/server-everything"],
    "tools_filter": ["read_file", "list_directory"]
  }
}

Realtime Mode

OpenAI Realtime runs a persistent bidirectional audio session using gpt-4o-realtime-preview. Instead of the STT → LLM → TTS pipeline, audio is streamed directly to OpenAI and responses arrive as audio — significantly reducing first-audio latency.
{
  "provider": "openai",
  "model": "gpt-4o-realtime-preview",
  "api_keys": {
    "openai": "sk-..."
  }
}
Use realtime.NewFromConfig(cfg, "openai") in code; avoid importing the realtime package directly from services to prevent an import cycle. Realtime mode requires a voice-enabled build (CGO_ENABLED=1) when using WebRTC transport.

Configuration Reference

KeyTypeDescription
stt_providerstringSet to "openai"
stt_modelstringSTT model (e.g. "gpt-4o-mini-transcribe")
llm_providerstringSet to "openai"
modelstringLLM chat model (e.g. "gpt-4o-mini")
tts_providerstringSet to "openai"
tts_voicestringTTS voice name (e.g. "nova")
tts_modelstringTTS model (e.g. "tts-1" or "tts-1-hd")
api_keys.openaistringOpenAI API key (falls back to OPENAI_API_KEY)
providerstringSet to "openai" for Realtime mode (replaces llm_provider)

Notes and Limitations

  • The factory treats "openai" as the default LLM and STT provider. If provider, stt_provider, llm_provider, and tts_provider are all unset, Voxray falls back to OpenAI for all three stages.
  • tts-1-hd produces higher-quality audio at the cost of higher latency; tts-1 is recommended for real-time voice.
  • Tool calling requires llm_provider: "openai" — it is not available when using the Realtime session path.
  • Realtime mode and the STT→LLM→TTS pipeline are mutually exclusive per session. Choose one via config.