Skip to main content

Single-instance vs multi-instance

Voxray follows a one goroutine per active connection model. Each Runner goroutine is fully isolated, so vertical scaling (larger CPU/memory) is the first lever to pull.
The default session store is in-memory. No external dependencies are needed.
{
  "session_store": "memory"
}
All session state lives in the process. Restarting the process loses in-flight sessions. This mode is appropriate for development, single-region deployments, and workloads where one instance can handle peak load.
Vertical scaling — giving the instance more CPU cores and RAM — is the simplest path to higher throughput. Go’s goroutine scheduler saturates available cores automatically.

Load balancer requirements

WebSocket upgrade passthrough

All load balancers in front of Voxray must pass the Upgrade: websocket and Connection: Upgrade headers through without stripping them. Without this, WebSocket handshakes fail at the proxy.
# nginx example
location /ws {
    proxy_pass         http://voxray_upstream;
    proxy_http_version 1.1;
    proxy_set_header   Upgrade $http_upgrade;
    proxy_set_header   Connection "upgrade";
    proxy_read_timeout 3600s;
}
Increase proxy_read_timeout (nginx) or the equivalent idle-connection timeout in your load balancer to match the longest expected call duration. A 60-second default will terminate active voice sessions.

WebRTC signaling

WebRTC media is peer-to-peer: once ICE negotiation completes, audio flows directly between the client and the provider (or via TURN). The Voxray server only handles signaling.
  • POST /webrtc/offer — stateless; can land on any instance.
  • POST /start and PATCH /sessions/{id}/api/offer — session lookup is backed by Redis in multi-instance mode, so these are also instance-agnostic.
No sticky sessions (IP affinity) are required when Redis is the session store. You may enable session affinity as an optimization to reduce Redis round-trips, but it is not mandatory for correctness.

Telephony WebSocket backhaul

Telephony providers (Twilio, Telnyx, Plivo, Exotel) open a persistent WebSocket to /telephony/ws. This connection must remain on the same instance for the call’s lifetime. Use cookie-based or IP-based session affinity for telephony traffic, or terminate telephony on a dedicated instance pool.

Performance tuning

The following config keys control buffering and write behavior on the hot path between transport and pipeline.
Config KeyEnv VariableDefaultDescription
pipeline_input_queue_capVOXRAY_PIPELINE_INPUT_QUEUE_CAP256Channel buffer between transport read and pipeline push. When full, the reader blocks — back-pressure propagates to the transport rather than unbounded memory growth. Increase under bursty audio input.
ws_write_coalesce_msVOXRAY_WS_WRITE_COALESCE_MS0 (disabled)When greater than zero, the WebSocket writer batches frames within this window (milliseconds) before flushing. Reduces syscall count; adds a small fixed latency.
ws_write_coalesce_max_framesVOXRAY_WS_WRITE_COALESCE_MAX_FRAMESMaximum frames per coalesce batch. Only active when ws_write_coalesce_ms > 0.
Enable write coalescing only when syscall overhead is measurably impacting throughput (e.g., many concurrent low-bandwidth sessions). For latency-sensitive voice agents, keep ws_write_coalesce_ms=0.

S3 recording at scale

Recording uploads run through an async worker pool so they never block the voice pipeline.

Worker pool configuration

Config KeyEnv VariableDefaultDescription
recording.worker_countVOXRAY_RECORDING_WORKER_COUNT2Number of goroutines processing the upload queue concurrently.
recording.queue_capVOXRAY_RECORDING_QUEUE_CAP32Job queue capacity. When full, new upload jobs are dropped (with a log warning). Size this to absorb bursts between worker completions.
recording.max_retriesVOXRAY_RECORDING_MAX_RETRIES3Retry attempts per upload on S3 error, with exponential backoff.
Each worker is a lightweight goroutine. Memory usage is bounded because uploads stream from a temporary file to S3 — the full WAV is never buffered in memory. Scale worker_count proportionally to your calls-per-minute volume and typical recording duration.

Storage layout

Recordings are stored under the configured S3 bucket and base path using a date-partitioned key:
s3://{bucket}/{base_path}/YYYY/MM/DD/{sessionId}.{format}
For example, with recording.base_path = "recordings/" and recording.format = "wav":
s3://my-bucket/recordings/2026/05/15/sess_abc123.wav
This layout makes lifecycle policies (e.g., S3 Intelligent-Tiering by prefix) and date-range queries straightforward without a separate index.
Enable recording by setting "recording.enable": true (or VOXRAY_RECORDING_ENABLE=true) alongside the bucket and credentials. AWS credentials must be available via the standard chain: env vars, instance role, or shared credentials file.

Transcript database at scale

When transcript logging is enabled, Voxray writes each turn (speaker role, text, sequence number, timestamp) to a SQL database. The table is created automatically on startup if it does not exist.

Supported drivers

{
  "transcripts.enable": true,
  "transcripts.driver": "postgres",
  "transcripts.dsn": "postgres://user:pass@host:5432/db?sslmode=require",
  "transcripts.table": "call_transcripts"
}
Use sslmode=require in production. For connection pooling, set pool_max_conns (pgx DSN) or use PgBouncer in transaction mode in front of Postgres.

Table schema

The following schema is auto-created on startup:
ColumnTypeDescription
session_idTEXT / VARCHARVoxray session identifier
roleTEXT / VARCHARSpeaker role (user or assistant)
textTEXTTranscript text for this turn
seqINTEGERTurn sequence number within the session
created_atTIMESTAMPWall-clock time of the turn
Connection pooling is handled by Go’s database/sql package. To tune pool size, append DSN parameters specific to your driver (e.g., max_open_conns, max_idle_conns via pgx extended DSN, or use a proxy like PgBouncer).

Kubernetes deployment

The following example deploys three Voxray replicas backed by a Redis session store, with secrets injected via Kubernetes Secrets.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: voxray
  labels:
    app: voxray
spec:
  replicas: 3
  selector:
    matchLabels:
      app: voxray
  template:
    metadata:
      labels:
        app: voxray
    spec:
      containers:
        - name: voxray
          image: your-registry/voxray-go:latest
          ports:
            - containerPort: 8080
          env:
            - name: VOXRAY_SESSION_STORE
              value: "redis"
            - name: VOXRAY_REDIS_URL
              valueFrom:
                secretKeyRef:
                  name: voxray-secrets
                  key: redis-url
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  name: voxray-secrets
                  key: openai-api-key
            - name: VOXRAY_JSON_LOGS
              value: "true"
            - name: VOXRAY_LOG_LEVEL
              value: "info"
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 2
            periodSeconds: 5
          resources:
            requests:
              cpu: "500m"
              memory: "256Mi"
            limits:
              cpu: "2"
              memory: "1Gi"
---
apiVersion: v1
kind: Service
metadata:
  name: voxray
spec:
  selector:
    app: voxray
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
  type: ClusterIP

Horizontal Pod Autoscaler

Voxray exposes http_active_connections as a Prometheus gauge. Use the Prometheus Adapter to surface this as a custom metric, then configure an HPA to scale based on connections per pod:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: voxray-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: voxray
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_active_connections
        target:
          type: AverageValue
          averageValue: "50"
Set minReplicas: 2 for production to maintain availability during a single pod restart or rolling update.