All tutorials assume you have completed the Quickstart and have a working Voxray binary. Each tutorial lists its required providers in the table below. Most providers offer a free tier or trial credits — links are included in each tutorial’s prerequisites section.
Tutorial progression
| Tutorial | Goal | Difficulty | Providers required | Estimated time |
|---|---|---|---|---|
| Echo Bot | Verify the full audio round-trip with no external API calls | Beginner | None | 5 min |
| Local LLM with Ollama | Run a fully self-hosted voice agent — STT via Whisper, LLM via Ollama, no cloud | Beginner | Ollama (local) | 10 min |
| Cloud LLM with OpenAI | Full cloud pipeline — OpenAI for STT, LLM, and TTS | Intermediate | OpenAI | 10 min |
| Telephony with Twilio | Accept real phone calls and route them into a voice agent | Advanced | Twilio + Groq | 20 min |
| Production Pipeline | Multi-provider pipeline with S3 recording and database transcripts | Advanced | Groq + Anthropic + ElevenLabs | 30 min |
Echo Bot
Goal: Confirm your Voxray installation is correct and that audio flows end-to-end without calling any external provider. The echo plugin receives audio frames from the client and plays them back immediately. No API keys required. What you will learn:- How to enable the
echoplugin - How to read pipeline frame logs to confirm audio is moving through each stage
- How to identify transport-layer problems before adding provider complexity
"plugins": ["echo"] and "transport": "websocket". Leave all provider fields empty. Connect with the browser client and speak — you should hear your own voice played back within 200–300 ms.
This tutorial is also the fastest way to confirm a fresh installation is working before spending any API credits.
Go to Echo Bot tutorial →
Local LLM with Ollama
Goal: Build a fully self-hosted voice agent where no audio or text leaves your machine. Ollama serves an LLM locally; the pipeline uses Whisper for STT and a local TTS model (or OpenAI TTS if you want audio quality without cloud LLM costs). What you will learn:- How to configure
"llm_provider": "ollama"and point it at a local Ollama endpoint - How Voxray resolves the
providerfallback versus task-specific provider keys - How to test that latency is acceptable on local hardware (and which model sizes are practical for real-time voice)
- How to combine a cloud STT with a local LLM when local Whisper is too slow
ollama serve), a pulled model (ollama pull llama3.2). Optionally an OpenAI key for cloud STT/TTS.
Go to Local LLM with Ollama tutorial →
Cloud LLM with OpenAI
Goal: Wire up a complete cloud pipeline using OpenAI for all three stages —gpt-4o-mini-transcribe for STT, gpt-4o-mini for LLM, and OpenAI TTS for speech synthesis. This is the fastest path to a production-quality voice experience with a single API key.
What you will learn:
- How to set per-stage provider and model fields (
stt_model,model,tts_voice) - How to configure a system prompt via the client context payload
- How
turn_stop_secsandvad_min_volumeinteract with OpenAI’s transcription latency - How to read the Prometheus metrics at
/metricsto measure STT, LLM, and TTS latency independently
gpt-4o-mini and gpt-4o-mini-transcribe.
Go to Cloud LLM with OpenAI tutorial →
Telephony with Twilio
Goal: Accept real inbound phone calls via Twilio and route the caller’s audio into a Voxray voice agent. The caller hears the agent’s TTS responses over the phone. This tutorial uses Groq for low-latency STT and LLM, which is important for telephony where callers expect faster response than browser users. What you will learn:- How to set
"runner_transport": "twilio"and configure the Twilio webhook - How to expose your local Voxray server to the internet using ngrok (or a similar tunnel) and set
proxy_host - How telephony audio differs from browser audio (8 kHz µ-law vs 16 kHz PCM) and what Voxray handles automatically
- How to handle dropped calls, reconnects, and telephony-specific error codes in the server logs
Production Pipeline
Goal: Deploy a production-grade voice agent with a multi-provider pipeline, S3 conversation recording, Postgres transcript logging, Prometheus monitoring, and CORS configuration for a real front-end domain. This tutorial assembles all the pieces you have learned individually. What you will learn:- How to mix providers for best performance and cost: Groq (fast STT), Anthropic Claude (reasoning quality), ElevenLabs (high-quality voice)
- How to enable and configure S3 recording (
recordingblock) with async upload workers - How to enable Postgres transcript logging (
transcriptsblock) and the expected schema - How to set
cors_allowed_originsfor a production front-end andserver_api_keyto protect the WebSocket endpoint - How to tune
user_idle_timeout_secsandrtc_max_duration_secsfor session lifecycle management - How to read Prometheus metrics and which counters matter most for production alerting
Choosing a starting point
If you are evaluating Voxray with no accounts set up, start with the Echo Bot. It proves your binary and browser client work in under 5 minutes. If you are building a prototype and have an OpenAI key, jump to Cloud LLM with OpenAI. One key, one config block, full pipeline. If you have data residency or cost requirements, do the Local LLM with Ollama tutorial next — you will understand how to mix local and cloud components and when each trade-off makes sense. If you are building a phone product, Telephony with Twilio is your path. Read it before the Production Pipeline tutorial because telephony introduces constraints (audio codec, session lifecycle, webhook reliability) that affect your architecture decisions.Echo Bot
No providers needed. Verify your pipeline in 5 minutes.
Local LLM with Ollama
Fully self-hosted. No cloud API calls.
Cloud LLM with OpenAI
One API key. Full production-quality pipeline.
Telephony with Twilio
Accept real phone calls into your voice agent.
Production Pipeline
Multi-provider + recording + transcripts + monitoring.
Core Concepts
Understand the pipeline before diving into tutorials.