LLM
Supported
STT
Not supported
TTS
Not supported
"llm_provider": "ollama". Because Ollama runs locally, you pair it with a cloud STT and TTS provider (Groq, ElevenLabs, OpenAI, etc.) for a full voice pipeline.
Prerequisites
Install Ollama
Start the Ollama server
http://localhost:11434 by default and exposes an OpenAI-compatible API at /v1. Voxray connects to http://localhost:11434/v1 unless you override OLLAMA_BASE_URL.
Pull a model
Quick Start Config
- config.json
- Environment Variables
Ollama does not require an API key. The
OLLAMA_API_KEY environment variable (and "ollama" key in api_keys) are accepted by Voxray but passed through unused — the Ollama server ignores them.Custom Ollama URL
By default, Voxray connects tohttp://localhost:11434/v1. Override this with the OLLAMA_BASE_URL environment variable when running Ollama on a remote host or a non-default port:
Popular Models
| Model | Size on Disk | Recommended VRAM | Best For |
|---|---|---|---|
llama3.2:3b | ~2 GB | 4 GB | Lowest latency, resource-constrained |
llama3.1:8b | ~4.7 GB | 8 GB | Balanced quality and speed |
mistral:7b | ~4.1 GB | 8 GB | Strong reasoning, fast inference |
llama3.1:70b | ~40 GB | 48 GB | High quality, production-grade |
llama3.1:405b | ~230 GB | 256+ GB | Maximum quality, multi-GPU setups |
"model" key in config.json to match the pulled model name exactly (including the tag).
GPU Acceleration
Ollama automatically detects and uses available GPU hardware — no additional configuration is needed in Voxray:| Hardware | Requirement |
|---|---|
| NVIDIA GPU | CUDA drivers installed (nvidia-smi should work) |
| Apple Silicon (M1/M2/M3/M4) | Metal is used automatically |
| AMD GPU (Linux) | ROCm drivers installed |
| CPU fallback | Always available; inference is slower |
ollama serve:
Latency Guidance
Voice agents are latency-sensitive. The LLM is typically the largest contributor to end-to-end response time with local inference.| Priority | Model | Notes |
|---|---|---|
| Lowest latency | llama3.2:3b | Fits comfortably in 4 GB VRAM; very fast on modern GPUs |
| Balanced | llama3.1:8b | Good default; fits in 8 GB VRAM with room to spare |
| Best quality | llama3.1:70b | Needs 48 GB+ VRAM; use multi-GPU or quantized variant |
| Maximum quality | llama3.1:405b | Requires 256+ GB VRAM across multiple GPUs |
Full Example: Local LLM with Groq STT and TTS
/ws or /webrtc/offer. All LLM inference runs locally; only STT and TTS calls leave your machine.
Troubleshooting
| Symptom | Likely Cause | Fix |
|---|---|---|
connection refused on LLM calls | Ollama not running | Run ollama serve |
| Model not found error | Model not pulled | Run ollama pull <model> |
| Slow first response | Model loading from disk | Wait for first load; subsequent calls use cached model |
| Out of memory | Model too large for VRAM | Use a smaller model or a quantized variant |
| High latency on CPU | No GPU available | Install GPU drivers or switch to a 3B model |