Skip to main content

Documentation Index

Fetch the complete documentation index at: https://voxray-cac3ed72.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Voxray supports a large and growing set of AI providers for speech-to-text (STT), large language model (LLM), text-to-speech (TTS), and real-time pipeline tasks. Each task is configured independently — you can mix and match providers freely (e.g. Groq for STT, Anthropic for LLM, ElevenLabs for TTS). Provider keys are set via provider, stt_provider, llm_provider, or tts_provider in your config.json, or via the api_keys map or the corresponding environment variable listed below.

Capability Matrix

ProviderProvider KeySTTLLMTTSRealtimeAPI Key Env Var
OpenAIopenaiOPENAI_API_KEY
GroqgroqGROQ_API_KEY
SarvamsarvamSARVAM_API_KEY
ElevenLabselevenlabsELEVENLABS_API_KEY
AWS (Bedrock + Transcribe + Polly)awsAWS_SECRET_ACCESS_KEY
Google (Gemini + Cloud Speech + Cloud TTS)googleGOOGLE_API_KEY
Google Vertex AIgoogle_vertexADC (no key)
AnthropicanthropicANTHROPIC_API_KEY
Grok (xAI)grokXAI_API_KEY
CerebrascerebrasCEREBRAS_API_KEY
MistralmistralMISTRAL_API_KEY
DeepSeekdeepseekDEEPSEEK_API_KEY
OllamaollamaOLLAMA_API_KEY (optional)
Qwen (Alibaba)qwenDASHSCOPE_API_KEY / QWEN_API_KEY
Whisper (self-hosted)whisperWHISPER_API_KEY / OPENAI_API_KEY
HumehumeHUME_API_KEY
InworldinworldINWORLD_API_KEY
MinimaxminimaxMINIMAX_API_KEY
NeuphonicneuphonicNEUPHONIC_API_KEY
XTTS (self-hosted)xttsXTTS_API_KEY (optional)
AsyncAIasyncaiASYNC_AI_API_KEY
CambcambCAMB_API_KEY
FishfishFISH_API_KEY
GradiumgradiumGRADIUM_API_KEY
MoondreammoondreamMOONDREAM_API_KEY
OpenPipeopenpipeOPENPIPE_API_KEY
SonioxsonioxSONIOX_API_KEY
Google Vertex AI uses Application Default Credentials (ADC) — no API key field is required. Set GOOGLE_CLOUD_PROJECT and GOOGLE_CLOUD_LOCATION instead.

LLM Providers

All LLM providers are configured via llm_provider (or the global provider field). The model field selects the specific chat model. When model is empty, the default shown below is used.
ProviderProvider KeyDefault ModelNotes
OpenAIopenaigpt-3.5-turboSupports any OpenAI chat model (GPT-4o, GPT-4 Turbo, etc.)
Groqgroqllama-3.1-8b-instantGroq-hosted Llama and Mixtral models; very low latency
Anthropicanthropicclaude-3-sonnet-20240229All Claude 3/3.5 models supported
Googlegooglegemini-2.5-flashGemini Flash and Pro families
Google Vertex AIgoogle_vertexgemini-2.5-flashSame Gemini models via Vertex; requires GOOGLE_CLOUD_PROJECT
Grok (xAI)grokgrok-2xAI Grok models
Cerebrascerebrasllama3.1-8bCerebras Wafer-Scale Engine; Llama-based models
Mistralmistralmistral-small-latestMistral Small, Medium, Large, Codestral
DeepSeekdeepseekdeepseek-chatDeepSeek Chat and Reasoner models
AWS Bedrockawsanthropic.claude-3-haiku-20240307-v1:0Any Bedrock-supported model ID; uses Converse API
Ollamaollamallama3.2Local inference; any model pulled via ollama pull
Qwenqwenqwen-plusAlibaba Cloud DashScope; Qwen-Plus, Turbo, Max families
AsyncAIasyncai(provider default)Async AI platform
Fishfish(provider default)Fish Audio models
Inworldinworld(provider default)Inworld character AI
Minimaxminimax(provider default)Minimax conversation models
Moondreammoondream(provider default)Moondream vision-language model
OpenPipeopenpipe(provider default)OpenPipe fine-tuned model routing
For the lowest LLM latency in production, use groq with llama-3.1-8b-instant or cerebras with llama3.1-8b. For highest quality, use anthropic with Claude 3.5 Sonnet or openai with GPT-4o.

STT Providers

All STT providers are configured via stt_provider (or the global provider field). Use stt_model to select a specific transcription model and stt_language to pin a language code.
ProviderProvider KeyDefault ModelStreamingNotes
OpenAIopenaiWhisper-1File-based upload; returns full transcript
Groqgroqwhisper-large-v3Groq-accelerated Whisper; very fast batch
Sarvamsarvamsaarika:v2.5Streaming STT (stt_streaming.go); Indian language support; saaras:v3 also available
ElevenLabselevenlabs(provider default)ElevenLabs speech recognition endpoint
AWSaws(region default)AWS Transcribe Streaming; region from AWS_REGION or aws_region config
Googlegoogle(project default)Google Cloud Speech-to-Text v2; requires GOOGLE_CLOUD_PROJECT
Whisper (self-hosted)whisper(server default)Compatible with OpenAI Whisper server; set WHISPER_BASE_URL
Cambcamb(model config)Camb.ai STT; set CAMB_BASE_URL
Gradiumgradium(model config)Gradium STT; set GRADIUM_BASE_URL
Sonioxsonioxstt-rt-v4WebSocket-based real-time transcription via wss://stt-rt.soniox.com; set SONIOX_WS_URL to override
Sarvam and Soniox use WebSocket-based streaming internally to reduce first-word latency. All other STT providers use a single-call batch transcription approach where audio is buffered per turn before sending.

TTS Providers

All TTS providers are configured via tts_provider (or the global provider field). Use tts_model to select a TTS model and tts_voice to select a voice/speaker ID.
ProviderProvider KeyDefault ModelDefault VoiceNotes
OpenAIopenaitts-1(API default)Voices: alloy, echo, fable, nova, onyx, shimmer; tts-1-hd for higher quality
Groqgroqcanopylabs/orpheus-v1-englishalloyGroq TTS returns WAV at 48 kHz; decoded to raw PCM
Sarvamsarvambulbul:v2anushkaSupports streaming (tts_streaming.go); Indian language voices; set stt_language for locale
ElevenLabselevenlabseleven_multilingual_v2(required voice ID)tts_voice must be a valid ElevenLabs voice ID; multilingual v2 recommended
AWS Pollyaws(region default)JoannaAny Polly voice ID; region from AWS_REGION
Google Cloud TTSgoogle(project default)(API default)Language set via stt_language (default en-US)
Humehume(voice config)Hume AI expressive TTS
Inworldinworld(voice config)Inworld character voices
Minimaxminimaxspeech-01(voice config)Set MINIMAX_BASE_URL for custom endpoint
Neuphonicneuphonic(voice config)Multilingual; stt_language sets locale (default en); set NEUPHONIC_BASE_URL
XTTS (self-hosted)xtts(voice config)Coqui XTTS self-hosted; set XTTS_BASE_URL; stt_language sets language (default en)

Realtime Providers

Realtime providers bypass the standard STT → LLM → TTS pipeline and handle the full voice session end-to-end, including VAD, turn detection, and audio I/O. Use realtime.NewFromConfig(cfg, provider) to construct (not NewServicesFromConfig).

OpenAI Realtime

  • Provider key: openai
  • API key env var: OPENAI_API_KEY
  • Full duplex real-time audio via the OpenAI Realtime API (WebSocket).
  • Handles VAD, interruptions, and function calling natively.
  • Recommended for the lowest end-to-end latency when using OpenAI models.

Hume

  • Provider key: hume
  • API key env var: HUME_API_KEY
  • Hume AI empathic voice interface; real-time emotional intelligence in the voice pipeline.
  • Also available as a TTS-only provider in the standard pipeline.

Inworld

  • Provider key: inworld
  • API key env var: INWORLD_API_KEY
  • Inworld AI character engine; real-time character dialogue with LLM and TTS bundled.
  • Also available as LLM and TTS providers in the standard pipeline.
Daily.co and LiveKit are supported as transport providers via runner_transport ("daily", "livekit") rather than as realtime AI providers. They handle WebRTC room management and media routing, while the STT/LLM/TTS pipeline still runs inside Voxray.