- Privacy-sensitive applications where conversation content must not leave your infrastructure
- Offline or air-gapped environments that need voice interaction
- Development and experimentation with open-weight models like Llama 3.1, Mistral, and Qwen
- Cost control when running high-volume or long-session voice agents
How it works
Voxray’sllm_provider: "ollama" connects to the Ollama HTTP API at http://localhost:11434/v1 (OpenAI-compatible). Ollama manages model loading, quantization, and GPU/CPU inference. Voxray sends the conversation transcript to Ollama, streams tokens back, and passes them to the TTS provider for synthesis.
Hardware requirements
| Model | VRAM / RAM required | Notes |
|---|---|---|
llama3.2:3b | ~4 GB | Fast on CPU; good for testing |
llama3.1:8b | ~8 GB | Best quality-to-speed ratio for most hardware |
mistral:7b | ~8 GB | Strong instruction following; good alternative |
llama3.1:70b | ~40 GB | Near-GPT-4 quality; requires GPU with sufficient VRAM or large RAM |
qwen2.5:14b | ~14 GB | Strong multilingual support |
Ollama automatically uses your GPU if CUDA (NVIDIA) or Metal (Apple Silicon) drivers are installed. Without GPU acceleration, inference runs on CPU — usable for development but noticeably slower for real-time voice interaction. See GPU acceleration below.
Prerequisites
- Voxray built and working (see the Echo Bot tutorial to verify your setup)
- A Groq API key for STT and TTS — sign up at console.groq.com (free tier, no credit card required), or an OpenAI API key if you prefer
- A machine with at least 8 GB of RAM for
llama3.1:8b(more is better; 16 GB recommended for a comfortable experience)
Step-by-step
Install Ollama
Install Ollama using the official one-line installer:On macOS, you can also use Homebrew:On Windows, download the installer from ollama.ai/download.After installation, start the Ollama server:Verify Ollama is running:
On macOS, Ollama runs as a background menu-bar app after installation and starts automatically. You do not need to run
ollama serve manually — it is already listening on port 11434.Pull a model
Pull This downloads approximately 4.7 GB. Progress is shown in the terminal. Once downloaded, the model is cached locally and subsequent startups are immediate.Alternative models worth trying:Test that the model responds correctly:If you get a response, Ollama is ready.
llama3.1:8b — a well-balanced open-weight model that works well for voice agents:Choose your STT and TTS providers
The LLM runs locally, but you still need STT and TTS. Two practical options:Option A: Groq (recommended — free with generous limits)Groq provides free API access to Whisper STT and its own TTS. This is the lowest-friction option: one API key, no credit card for the free tier.
- STT:
whisper-large-v3-turbo— fast, accurate, multilingual - TTS: Groq’s TTS voices (English)
- Sign up: console.groq.com
- Environment variable:
GROQ_API_KEY
- STT:
gpt-4o-mini-transcribeorwhisper-1 - TTS: voices like
alloy,nova,shimmer - Environment variable:
OPENAI_API_KEY
- STT model:
saarika:v2.5, language:hi-IN(or other BCP-47 codes) - TTS model:
bulbul:v2, voice:anushka - Environment variable:
SARVAM_API_KEY
Create the config file
Create For Option B (OpenAI STT + TTS):You can also supply API keys as environment variables instead of putting them in the config file:
ollama-config.json. The example below uses Groq for STT and TTS with Ollama for LLM (Option A):ollama-config.json
ollama-config.json
Start the server
Make sure Ollama is already running (Expected startup output:If the server starts without errors, all three providers initialized correctly.
ollama serve or the macOS menu-bar app), then start Voxray:Connect and test
Connect using the example WebSocket client or any WebSocket tool. The simplest test is to send a text frame directly:Type a message and press Enter. Voxray will:
- Pass your text to Groq Whisper (or, in a voice session, transcribe your audio first)
- Send the transcript to Ollama’s local
llama3.1:8b - Stream the LLM response tokens to the Groq TTS provider
- Stream synthesized audio back to your client
The WebRTC client requires a voice build with CGO and Opus support. If you built with
go build (no CGO), use the WebSocket client instead. See the installation guide for the voice build instructions.GPU acceleration
Ollama automatically detects and uses your GPU when the appropriate drivers are installed. You do not need to change any Voxray config — GPU acceleration is entirely managed by Ollama. NVIDIA (CUDA): Ollama requires CUDA 11.3 or higher. Ifnvidia-smi shows your GPU, Ollama will use it automatically.
llama3.1:8b fully in GPU memory.
CPU-only:
On CPU-only machines, inference is slower but functional. For real-time voice interaction, llama3.2:3b responds fast enough on modern CPUs. Expect 2–5 second response latency on a mid-range CPU for llama3.1:8b.
Troubleshooting
Ollama not reachable If Voxray logsconnection refused or dial tcp 127.0.0.1:11434: connect: connection refused, Ollama is not running or is listening on a different address.
Check if Ollama is running:
OLLAMA_BASE_URL environment variable before starting Voxray:
"model": "llama3.1:8b" in your config exactly matches the name shown by ollama list.
Response latency too high
Slow responses from Ollama add directly to voice agent round-trip time. Options to improve speed:
- Use a smaller model: switch from
llama3.1:8btollama3.2:3b - Enable GPU acceleration (see above)
- Increase Ollama’s GPU memory allocation:
OLLAMA_GPU_MEMORY_FRACTION=0.9 ollama serve - Run Ollama on a remote GPU server and point
OLLAMA_BASE_URLto it
- Lower
vad_min_volumein your config (e.g."vad_min_volume": 0.2) if quiet speech is being missed - Switch to
whisper-large-v3(slightly more accurate but slower thanwhisper-large-v3-turbo) - Check your microphone input levels
- Key not set: add
GROQ_API_KEYto your environment or"groq": "gsk_..."toapi_keysin config - Invalid key: copy the key fresh from console.groq.com
- Rate limit: Groq free tier has generous limits; if you hit them, add a brief
user_idle_timeout_secsto reduce idle connections
Complete config reference for this tutorial
turn_detection, turn_stop_secs, and vad_min_volume fields tune how Voxray detects when you have finished speaking before passing audio to STT. The defaults work for most microphones in quiet environments.
What to do next
Echo Bot
If you haven’t verified your base setup, start here — zero API keys required.
Configuration reference
Every config field, its default value, and the environment variable that overrides it.
Plugins and processors
Add wake-word detection, interruption control, frame filtering, and custom processors to your pipeline.
WebRTC transport
Switch from WebSocket to WebRTC for browser-based voice interaction with real-time audio.