Why Voice Agents Are the Next Interface
Text-based chatbots had their moment. But in 2026, the businesses winning customer engagement are the ones letting users talk to their AI — literally. Voice agents combine the power of large language models with natural-sounding speech synthesis to create experiences that feel like speaking with a real person.
ElevenLabs has emerged as the leading platform for building these agents. Their Conversational AI platform handles the hard parts — ultra-low-latency speech synthesis, WebRTC audio transport, voice activity detection, and turn-taking — so developers can focus on business logic.
This guide walks you through building a production-grade AI voice agent from scratch: architecture decisions, implementation patterns, prompt engineering for voice, and deployment best practices.
🎯 What You'll Build
- A real-time voice agent with WebRTC audio streaming
- Server-side token authentication (no exposed API keys)
- Client-side tool execution (book appointments, look up data)
- Custom voice selection and personality tuning
Architecture Overview
A voice agent system has three layers:
- Client (React) — Captures microphone audio, plays agent responses, handles UI state. Uses the
@elevenlabs/reactSDK. - Auth Server (Edge Function) — Generates short-lived conversation tokens so your API key never touches the browser.
- ElevenLabs Platform — Handles speech-to-text, LLM reasoning, text-to-speech, and audio transport via WebRTC.
Client (mic audio) → WebRTC → ElevenLabs STT → LLM → TTS → WebRTC → Client (speaker)
The entire round-trip typically takes 500ms–1.2s, making conversations feel natural.
Step 1: Server-Side Token Generation
Never expose your ElevenLabs API key in client-side code. Instead, create a server endpoint that generates short-lived conversation tokens:
// Edge Function: elevenlabs-conversation-token
- Receive the
agentIdfrom the client request - Call ElevenLabs' token endpoint with your server-side API key
- Return the short-lived token to the client
- Optionally fetch a WebSocket signed URL as fallback
This pattern is critical for production deployments. The token expires quickly, limiting exposure even if intercepted.
Step 2: React Client Implementation
The useConversation hook from @elevenlabs/react manages the entire WebRTC connection lifecycle:
- Connection management — Handles WebRTC negotiation, ICE candidates, and reconnection
- Audio capture — Requests microphone access and streams audio to ElevenLabs
- Playback — Receives and plays synthesized speech through the browser
- State tracking — Exposes
status,isSpeaking, and volume levels
The basic flow: request mic permission → fetch token from your server → call startSession() with the token → the user starts talking.
Step 3: Client Tools — Making Your Agent Do Things
Voice agents become powerful when they can take actions, not just talk. ElevenLabs supports "client tools" — functions the agent can invoke during conversation:
- Book an appointment — Agent collects date/time preferences, calls your scheduling API
- Look up order status — Agent asks for order number, queries your database
- Navigate the user — Agent directs to a specific page based on conversation context
- Submit a lead form — Agent gathers name, email, needs — submits to your CRM
⚠️ Important
Client tools must be configured in the ElevenLabs web UI before they'll work in your code. Define the tool name, description, and parameter schema in the agent settings — the SDK handles the rest.
Step 4: Choosing and Customizing Voices
Voice selection is a brand decision, not just a technical one. ElevenLabs offers 30+ pre-built voices and the ability to clone custom voices.
Pre-Built Voice Selection Guide
| Use Case | Recommended Voice | Why |
|---|---|---|
| Sales Agent | Chris / Sarah | Warm, conversational tone that builds trust |
| Tech Support | Daniel / Alice | Clear, authoritative, patient delivery |
| Customer Service | Laura / Liam | Friendly, empathetic, natural cadence |
| Executive Briefing | George / Matilda | Professional, polished, confident |
Step 5: Prompt Engineering for Voice
Writing prompts for voice agents is fundamentally different from text chatbots:
- Keep responses short — Aim for 1-3 sentences. Users can't "scan" voice like they scan text.
- Use conversational language — "Got it!" beats "I understand your request."
- Handle interruptions — Instruct the agent to gracefully yield when interrupted.
- Confirm actions verbally — "I've booked that for 3pm Tuesday. Sound good?"
- Avoid lists — Don't read off 5 options. Offer 2-3 and ask which direction to go.
Example System Prompt Structure:
You are [Name], a [role] for [Company]. Your personality is [traits]. Keep responses under 3 sentences unless the user asks for detail. When you need information, ask one question at a time. Always confirm before taking actions.
Step 6: Expressive Mode (New in 2026)
ElevenLabs launched Expressive Mode for ElevenAgents in February 2026. This isn't just better TTS — it's a fundamentally different approach to agent voice:
- Emotional awareness — The agent adapts tone based on conversation context (empathetic when a customer is frustrated, enthusiastic when closing a deal)
- Natural disfluencies — Subtle "um"s and breath patterns that make the voice feel human
- Dynamic pacing — Speeds up for excitement, slows down for important information
To enable Expressive Mode, toggle it in the ElevenLabs agent configuration panel. No code changes required — it enhances the existing voice pipeline.
Production Deployment Checklist
- ☐ API key stored as server-side environment variable (never in client code)
- ☐ Token generation endpoint rate-limited
- ☐ Microphone permission requested with clear UX explanation
- ☐ Graceful fallback for browsers without WebRTC support
- ☐ Error handling for network drops and reconnection
- ☐ Analytics tracking for conversation starts, duration, and completion
- ☐ Volume controls accessible to users
- ☐ Mobile-responsive agent UI tested on iOS and Android
Common Pitfalls and How to Avoid Them
| Pitfall | Solution |
|---|---|
| Agent talks too much | Add "keep responses under 3 sentences" to system prompt |
| Echo/feedback loops | Enable echo cancellation in microphone config |
| High latency on mobile | Use WebRTC (not WebSocket) and turbo model |
| Agent ignores interruptions | Tune VAD sensitivity; add interruption handling to prompt |
| Exposing API keys | Always use server-side token generation |
Real-World Use Cases We're Seeing
- After-hours receptionist — Voice agent handles calls when the office is closed, books callbacks for morning
- Website sales concierge — Embedded voice widget qualifies leads through natural conversation
- IT help desk tier-1 — Agent troubleshoots common issues (password resets, connectivity) before escalating
- Appointment scheduling — Patients or clients book time slots through voice instead of clicking through calendars
- Multilingual support — Single agent handles conversations in 29 languages using ElevenLabs' multilingual models
Ready to Build Your Voice Agent?
Whether you need a sales agent, support bot, or custom voice interface — our team builds production-ready voice agents that integrate with your existing systems.
From Concept to Deployed Voice Agent
We handle the architecture, voice selection, prompt engineering, and deployment — you get an AI agent that sounds like your brand.
Start Your Voice Agent ProjectRelated: ChatGPT for Business Guide · Agentic AI for Business · Managed IT Services
