Skip to main content

LLM

Not supported

STT

Supported — saarika:v2.5 and higher

TTS

Supported — bulbul:v2 with multiple Indian voices
Sarvam AI is an Indian AI company specializing in models built for Indian languages. Voxray supports Sarvam for both speech-to-text ("stt_provider": "sarvam") and text-to-speech ("tts_provider": "sarvam"). You can use one or both independently — for example, pair Sarvam STT with Sarvam TTS for a fully Indian-language pipeline, or mix Sarvam TTS with Groq LLM for a low-latency Hindi voice agent. Sarvam does not provide LLM functionality. You must pair it with a supported LLM provider such as openai, anthropic, groq, or ollama.

API Key

Get your API key from the Sarvam AI dashboard under your account settings.
{
  "api_keys": {
    "sarvam": "your_sarvam_api_key"
  }
}
Never commit your Sarvam API key to source control. Use the SARVAM_API_KEY environment variable in production and CI, and keep inline keys in config.json only for local development.

Quick Start Config

The example below mirrors the defaults in config.example.json — a Hindi voice agent using Sarvam for both STT and TTS with Groq for LLM inference.
{
  "stt_provider": "sarvam",
  "llm_provider": "groq",
  "tts_provider": "sarvam",

  "model": "llama-3.1-8b-instant",
  "stt_model": "saarika:v2.5",
  "stt_language": "hi-IN",
  "tts_model": "bulbul:v2",
  "tts_voice": "anushka",

  "api_keys": {
    "sarvam": "your_sarvam_api_key",
    "groq": "gsk_..."
  }
}
Sarvam does not support LLM. The llm_provider above is set to "groq" — swap in any other supported LLM provider (openai, anthropic, ollama, etc.) based on your needs.

Speech-to-Text

Sarvam’s STT engine (saarika) is purpose-built for Indian languages and handles code-switching (mixing Hindi and English in the same utterance) better than general-purpose English-first models.

STT Models

ModelDescription
saarika:v2.5Default. Recommended for most Indian language workloads. Accurate across all supported languages.
saaras:v3Higher-accuracy variant; larger model with slightly higher latency. Suitable when word-error-rate is more important than speed.
Set the model via "stt_model" in your config. When omitted, Voxray uses saarika:v2.5.

Language Detection

Set "stt_language" to a BCP-47 language code to pin the transcription language. When the field is empty or omitted, Sarvam performs automatic language detection.
Config ValueLanguage
hi-INHindi (India) — default in config.example.json
en-INEnglish (India)
ta-INTamil
te-INTelugu
kn-INKannada
ml-INMalayalam
mr-INMarathi
gu-INGujarati
bn-INBengali
od-INOdia
pa-INPunjabi
(omit or empty)Auto-detect
For the best accuracy, always set stt_language explicitly when you know the user’s language. Auto-detect adds a small amount of latency and may misidentify short utterances in closely related languages (e.g. Hindi vs Maithili).

Streaming STT

Sarvam implements Voxray’s STTStreamingService interface via a WebSocket-based streaming endpoint. When turn detection is active, audio chunks are streamed to Sarvam’s API in real time and partial transcripts are surfaced as they arrive, reducing the latency between end-of-speech and LLM handoff.

Text-to-Speech

Sarvam’s TTS engine (bulbul) produces natural-sounding speech for Indian languages. It returns audio as base64-encoded WAV, which Voxray decodes and strips of WAV headers before routing raw PCM into the pipeline.

TTS Models

ModelSample RateDescription
bulbul:v222050 HzDefault. Stable production model; good naturalness across all supported voices.
bulbul:v324000 HzHigher-quality variant at 24 kHz.
bulbul:v3-beta24000 HzBeta channel for bulbul:v3; may change without notice.
Set via "tts_model". When omitted, Voxray defaults to bulbul:v2.

Available Voices

VoiceGenderCharacter
anushkaFemaleDefault. Clear, professional Hindi voice.
manishaFemaleWarm, conversational tone.
vidyaFemaleFormal, news-reader style.
arjunMaleNatural male voice with neutral accent.
abhilashMaleExpressive, energetic delivery.
karunMaleDeep, authoritative tone.
hiteshMaleYouthful, casual delivery.
Set via "tts_voice". When omitted, Voxray defaults to anushka.
Voice quality and available voices may expand as Sarvam releases new model versions. Check the Sarvam API documentation for the current voice catalog for each model generation.

Streaming TTS

Sarvam TTS implements Voxray’s TTSStreamingService interface via a WebSocket streaming API. Audio chunks are delivered to the pipeline as they are synthesized, allowing TTS output to begin playing before the full text response has been generated by the LLM.

Full Pipeline Examples

Hindi Voice Agent (Sarvam STT + Groq LLM + Sarvam TTS)

{
  "transport": "both",
  "host": "0.0.0.0",
  "port": 8080,

  "stt_provider": "sarvam",
  "stt_model": "saarika:v2.5",
  "stt_language": "hi-IN",

  "llm_provider": "groq",
  "model": "llama-3.1-8b-instant",

  "tts_provider": "sarvam",
  "tts_model": "bulbul:v2",
  "tts_voice": "anushka",

  "api_keys": {
    "sarvam": "your_sarvam_api_key",
    "groq": "gsk_..."
  },

  "webrtc_ice_servers": [
    "stun:stun.l.google.com:19302"
  ],

  "turn_detection": "silence",
  "turn_stop_secs": 3.0
}

Sarvam TTS with OpenAI STT and LLM (mixed providers)

{
  "stt_provider": "openai",

  "llm_provider": "openai",
  "model": "gpt-4.1-mini",

  "tts_provider": "sarvam",
  "tts_model": "bulbul:v2",
  "tts_voice": "arjun",

  "api_keys": {
    "openai": "sk-...",
    "sarvam": "your_sarvam_api_key"
  }
}

Auto-Detect Language (multilingual support)

{
  "stt_provider": "sarvam",
  "stt_model": "saarika:v2.5",

  "llm_provider": "anthropic",
  "model": "claude-3-5-haiku-20241022",

  "tts_provider": "sarvam",
  "tts_model": "bulbul:v2",
  "tts_voice": "manisha",

  "api_keys": {
    "sarvam": "your_sarvam_api_key",
    "anthropic": "sk-ant-..."
  }
}
When stt_language is omitted, Sarvam auto-detects the spoken language. This is useful for deployments that serve callers in multiple Indian languages from a single agent instance.

Configuration Reference

KeyTypeDefaultDescription
stt_providerstringSet to "sarvam" to use Sarvam STT
stt_modelstring"saarika:v2.5"Sarvam STT model identifier
stt_languagestring(auto-detect)BCP-47 language code, e.g. "hi-IN". Empty means auto-detect.
tts_providerstringSet to "sarvam" to use Sarvam TTS
tts_modelstring"bulbul:v2"Sarvam TTS model identifier
tts_voicestring"anushka"Speaker voice name
api_keys.sarvamstringSarvam API key (falls back to SARVAM_API_KEY)

Troubleshooting

SymptomLikely CauseFix
401 Unauthorized from SarvamInvalid or missing API keyVerify SARVAM_API_KEY or api_keys.sarvam
Poor transcription accuracyWrong language code setSet stt_language to the correct BCP-47 code, or omit for auto-detect
No audio returned from TTSEmpty text input or network errorCheck server logs; confirm SARVAM_API_KEY is set and valid
TTS audio sounds distortedSample rate mismatchEnsure the client handles 22050 Hz (bulbul:v2) or 24000 Hz (bulbul:v3) audio
Code-switched utterances not transcribedModel versionUpgrade to saaras:v3 for better handling of Hindi-English code-switching