Providers & Services

What are services?

Voxray organizes all AI calls — speech recognition, language model inference, and speech synthesis — behind three Go interfaces: STTService, LLMService, and TTSService. The rest of the pipeline (VAD, turn detection, transport, recording) never imports a provider SDK directly. It only calls methods on these interfaces. This means you can swap from OpenAI Whisper to Groq’s transcription endpoint, or from OpenAI GPT-4o to Anthropic Claude, by changing one or two keys in config.json. No code changes, no recompilation.

Provider selection, model names, and voices are all resolved at startup from config.json. After the factory constructs the service objects, every pipeline stage works identically regardless of which provider is underneath.

Service interfaces

These are the actual Go interfaces every provider implementation must satisfy. They live in pkg/services/interfaces.go and pkg/services/llmapi/api.go.

// STTService — batch transcription. The pipeline calls this for each audio segment.
type STTService interface {
    Transcribe(ctx context.Context, audio []byte, sampleRate, numChannels int) ([]*frames.TranscriptionFrame, error)
}

// STTStreamingService — extends STTService with streaming transcription.
// Providers that implement this send interim and final TranscriptionFrames as
// audio arrives on audioCh, without waiting for the full segment.
type STTStreamingService interface {
    STTService
    TranscribeStream(
        ctx context.Context,
        audioCh <-chan []byte,
        sampleRate, numChannels int,
        outCh chan<- frames.Frame,
    )
}

// TTSService — batch synthesis. Returns raw PCM audio frames from a complete text string.
type TTSService interface {
    Speak(ctx context.Context, text string, sampleRate int) ([]*frames.TTSAudioRawFrame, error)
}

// TTSStreamingService — extends TTSService with incremental audio output.
// Providers that implement this begin writing TTSAudioRawFrames to outCh
// before the full response is received, reducing time-to-first-audio.
type TTSStreamingService interface {
    TTSService
    SpeakStream(ctx context.Context, text string, sampleRate int, outCh chan<- frames.Frame)
}

// LLMService — chat completion with optional token streaming.
// The onToken callback receives each LLMTextFrame as tokens arrive.
type LLMService interface {
    Chat(ctx context.Context, messages []map[string]any, onToken func(*frames.LLMTextFrame)) error
}

// LLMServiceWithTools — extends LLMService with function/tool calling.
// Registered tools are forwarded to the LLM in its next chat request.
// Tool handlers are invoked synchronously during streaming.
type LLMServiceWithTools interface {
    LLMService
    RegisterTool(schema schemas.FunctionSchema, handler ToolHandler)
    ToolsSchema() *schemas.ToolsSchema
}

// ToolHandler — called when the LLM emits a tool call.
// Returns the result string to feed back into the conversation.
type ToolHandler func(ctx context.Context, toolName string, arguments map[string]any) (string, error)

Realtime sessions

Some providers (OpenAI Realtime, Hume, Inworld) expose a single bidirectional session that merges STT, LLM, and TTS into one persistent connection, rather than three separate request/response calls. Voxray models this with a separate RealtimeSession interface.

// RealtimeSession — long-lived bidirectional AI conversation.
// Audio and text flow in; LLM text and TTS audio flow out over Events().
type RealtimeSession interface {
    SendText(ctx context.Context, text string) error
    SendAudio(ctx context.Context, audio []byte, sampleRate, numChannels int) error
    Events() <-chan RealtimeEvent
    Close(ctx context.Context) error
}

// RealtimeService — factory for realtime sessions.
type RealtimeService interface {
    NewSession(ctx context.Context, cfg RealtimeConfig) (RealtimeSession, error)
}

Realtime providers bypass the independent STT → LLM → TTS chain. Instead, RealtimeSession.Events() emits RealtimeEvent values that carry either LLMTextFrame or TTSAudioRawFrame — whatever the provider sends first. This achieves lower latency but removes the ability to mix providers (e.g. OpenAI Realtime STT + Anthropic LLM). Use realtime.NewFromConfig(cfg, provider) to construct realtime sessions; never call NewLLMFromConfig / NewSTTFromConfig / NewTTSFromConfig for realtime providers.

Factory functions

pkg/services/factory.go wires provider constants to concrete implementations. You never construct provider clients directly.

Function	Purpose
`NewLLMFromConfig(cfg, provider, model)`	Returns an `LLMService` for `provider` using `model`. Falls back to `gpt-3.5-turbo` if model is empty and provider is OpenAI.
`NewSTTFromConfig(cfg, provider)`	Returns an `STTService` for `provider`. Uses `cfg.STTModel` and `cfg.STTLanguage` where the provider supports them.
`NewTTSFromConfig(cfg, provider, model, voice)`	Returns a `TTSService` for `provider`. Voice defaults apply per-provider (e.g. `"Joanna"` for AWS Polly).
`NewServicesFromConfig(cfg)`	Convenience wrapper. Resolves `stt_provider`, `llm_provider`, `tts_provider` (or falls back to `provider`) and returns all three services at once.

NewServicesFromConfig is what the pipeline runner calls at startup. It applies the provider precedence rules so you don’t have to call the individual functions yourself unless you need fine-grained control.

API key resolution

For every provider, the factory calls cfg.GetAPIKey(serviceName, envVarName). The resolution order is:

config.api_keys[serviceName] — value in the JSON config’s api_keys map.
Environment variable — the provider-specific env var (e.g. OPENAI_API_KEY).
Empty string — the service is constructed with an empty key. Most providers will return authentication errors at the first API call.

Never commit API keys to source control. In production, set them via environment variables or a secrets manager and leave api_keys values empty (or omit the keys entirely) in config.json.

Special cases:

Provider	Notes
`google_vertex`	Uses Application Default Credentials (ADC). No API key is looked up — configure via `gcloud auth application-default login` or a service account.
`qwen`	Checks `DASHSCOPE_API_KEY` first, then `QWEN_API_KEY`.
`whisper`	Checks `WHISPER_API_KEY` first, then falls back to `OPENAI_API_KEY`.
`aws`	Secret key is resolved for auth; region is read separately from `api_keys.aws_region` or `AWS_REGION` (defaults to `us-east-1`).
`google`	Project and location for Cloud Speech/TTS are resolved from `api_keys.google_cloud_project` / `GOOGLE_CLOUD_PROJECT` and `api_keys.google_cloud_location` / `GOOGLE_CLOUD_LOCATION` (defaults to `us-central1`).

Provider support matrix

The table below covers every provider registered in factory.go. The API Key Env Var column is the fallback environment variable if the key is not in api_keys.

Provider	Config Key	STT	LLM	TTS	Realtime	API Key Env Var
OpenAI	`openai`	Yes	Yes	Yes	Yes	`OPENAI_API_KEY`
Groq	`groq`	Yes	Yes	Yes	No	`GROQ_API_KEY`
Anthropic	`anthropic`	No	Yes	No	No	`ANTHROPIC_API_KEY`
Grok (xAI)	`grok`	No	Yes	No	No	`XAI_API_KEY`
Cerebras	`cerebras`	No	Yes	No	No	`CEREBRAS_API_KEY`
Mistral	`mistral`	No	Yes	No	No	`MISTRAL_API_KEY`
DeepSeek	`deepseek`	No	Yes	No	No	`DEEPSEEK_API_KEY`
AWS	`aws`	Yes	Yes	Yes	No	`AWS_SECRET_ACCESS_KEY`
Google	`google`	Yes	Yes	Yes	No	`GOOGLE_API_KEY`
Google Vertex	`google_vertex`	No	Yes	No	No	ADC (no key)
Ollama	`ollama`	No	Yes	No	No	`OLLAMA_API_KEY`
Qwen	`qwen`	No	Yes	No	No	`DASHSCOPE_API_KEY`
AsyncAI	`asyncai`	No	Yes	No	No	`ASYNC_AI_API_KEY`
Fish	`fish`	No	Yes	No	No	`FISH_API_KEY`
Inworld	`inworld`	No	Yes	Yes	Yes	`INWORLD_API_KEY`
Minimax	`minimax`	No	Yes	Yes	No	`MINIMAX_API_KEY`
Moondream	`moondream`	No	Yes	No	No	`MOONDREAM_API_KEY`
OpenPipe	`openpipe`	No	Yes	No	No	`OPENPIPE_API_KEY`
ElevenLabs	`elevenlabs`	Yes	No	Yes	No	`ELEVENLABS_API_KEY`
Sarvam	`sarvam`	Yes	No	Yes	No	`SARVAM_API_KEY`
Hume	`hume`	No	No	Yes	Yes	`HUME_API_KEY`
Neuphonic	`neuphonic`	No	No	Yes	No	`NEUPHONIC_API_KEY`
XTTS	`xtts`	No	No	Yes	No	`XTTS_API_KEY`
Whisper	`whisper`	Yes	No	No	No	`WHISPER_API_KEY`
Camb	`camb`	Yes	No	No	No	`CAMB_API_KEY`
Gradium	`gradium`	Yes	No	No	No	`GRADIUM_API_KEY`
Soniox	`soniox`	Yes	No	No	No	`SONIOX_API_KEY`

AWS STT uses Amazon Transcribe. AWS LLM uses Amazon Bedrock. AWS TTS uses Amazon Polly. All three share the same aws config key but use different regional API surfaces; the AWS SDK credential chain (env vars, ~/.aws/credentials, IAM role) applies independently of the aws entry in api_keys.

Realtime providers in depth

Three providers implement the RealtimeSession interface rather than separate STT/LLM/TTS services. Each maintains a single persistent WebSocket or streaming RPC to the provider.

Provider	Session type	Notes
OpenAI Realtime	WebSocket (`gpt-4o-realtime-*`)	Low-latency bidirectional audio and text; supports function calling mid-session.
Hume	Streaming API	Emotion-aware voice; TTS is also available as a standalone service for non-realtime pipelines.
Inworld	Streaming RPC	Game/character-oriented; LLM and TTS are also independently available for standard pipelines.

When provider is set to one of these in the config, construct sessions with realtime.NewFromConfig(cfg, provider) — not via the NewLLMFromConfig/NewSTTFromConfig/NewTTSFromConfig factories. The runner selects the realtime path automatically when a realtime provider is detected. Request/response pipeline (standard):

AudioIn → VAD → STTService.Transcribe() → LLMService.Chat() → TTSService.Speak() → AudioOut

Realtime session pipeline:

AudioIn → RealtimeSession.SendAudio() ──► RealtimeSession.Events() → AudioOut/TextOut

The realtime path eliminates two round-trip boundaries (STT result → LLM, LLM token → TTS), which is the primary source of latency reduction.

​What are services?

​Service interfaces

​Realtime sessions

​Factory functions

​API key resolution

​Provider support matrix

​Realtime providers in depth

What are services?

Service interfaces

Realtime sessions

Factory functions

API key resolution

Provider support matrix

Realtime providers in depth