Sarvam STT and TTS integration for Indian language voice agents
LLM
Not supported
STT
Supported — saarika:v2.5 and higher
TTS
Supported — bulbul:v2 with multiple Indian voices
Sarvam AI is an Indian AI company specializing in models built for Indian languages. Voxray supports Sarvam for both speech-to-text ("stt_provider": "sarvam") and text-to-speech ("tts_provider": "sarvam"). You can use one or both independently — for example, pair Sarvam STT with Sarvam TTS for a fully Indian-language pipeline, or mix Sarvam TTS with Groq LLM for a low-latency Hindi voice agent.Sarvam does not provide LLM functionality. You must pair it with a supported LLM provider such as openai, anthropic, groq, or ollama.
Voxray reads SARVAM_API_KEY automatically when no inline key is present under api_keys.sarvam.
Never commit your Sarvam API key to source control. Use the SARVAM_API_KEY environment variable in production and CI, and keep inline keys in config.json only for local development.
Sarvam does not support LLM. The llm_provider above is set to "groq" — swap in any other supported LLM provider (openai, anthropic, ollama, etc.) based on your needs.
Sarvam’s STT engine (saarika) is purpose-built for Indian languages and handles code-switching (mixing Hindi and English in the same utterance) better than general-purpose English-first models.
Set "stt_language" to a BCP-47 language code to pin the transcription language. When the field is empty or omitted, Sarvam performs automatic language detection.
Config Value
Language
hi-IN
Hindi (India) — default in config.example.json
en-IN
English (India)
ta-IN
Tamil
te-IN
Telugu
kn-IN
Kannada
ml-IN
Malayalam
mr-IN
Marathi
gu-IN
Gujarati
bn-IN
Bengali
od-IN
Odia
pa-IN
Punjabi
(omit or empty)
Auto-detect
For the best accuracy, always set stt_language explicitly when you know the user’s language. Auto-detect adds a small amount of latency and may misidentify short utterances in closely related languages (e.g. Hindi vs Maithili).
Sarvam implements Voxray’s STTStreamingService interface via a WebSocket-based streaming endpoint. When turn detection is active, audio chunks are streamed to Sarvam’s API in real time and partial transcripts are surfaced as they arrive, reducing the latency between end-of-speech and LLM handoff.
Sarvam’s TTS engine (bulbul) produces natural-sounding speech for Indian languages. It returns audio as base64-encoded WAV, which Voxray decodes and strips of WAV headers before routing raw PCM into the pipeline.
Set via "tts_voice". When omitted, Voxray defaults to anushka.
Voice quality and available voices may expand as Sarvam releases new model versions. Check the Sarvam API documentation for the current voice catalog for each model generation.
Sarvam TTS implements Voxray’s TTSStreamingService interface via a WebSocket streaming API. Audio chunks are delivered to the pipeline as they are synthesized, allowing TTS output to begin playing before the full text response has been generated by the LLM.
When stt_language is omitted, Sarvam auto-detects the spoken language. This is useful for deployments that serve callers in multiple Indian languages from a single agent instance.