What is ElevenLabs Conversational AI?

ElevenLabs Conversational AI is a platform for building real-time voice agents that combine speech-to-text, LLM reasoning, and text-to-speech into a single WebRTC-powered pipeline. It handles turn-taking, voice activity detection, and audio transport so developers can focus on business logic.

How much does it cost to build a voice agent with ElevenLabs?

ElevenLabs plans start at $5/month for hobbyist use. Business-grade voice agents typically require the Scale plan ($99/month) or Business plan ($330/month) depending on usage volume. Implementation costs from an agency range from $2,000–$5,000 for a production-ready agent.

Do I need to expose my API key to use ElevenLabs in a web app?

No. You should never expose your API key in client-side code. Instead, create a server-side endpoint (like a Supabase Edge Function) that generates short-lived conversation tokens. The client uses these tokens to authenticate with ElevenLabs.

What is Expressive Mode in ElevenLabs?

Expressive Mode, launched in February 2026, enables voice agents to adapt their emotional tone in real-time based on conversation context. It includes natural disfluencies, dynamic pacing, and context-aware emphasis — making AI voices nearly indistinguishable from human speakers.

Build AI Voice Agents With ElevenLabs

Why Voice Agents Are the Next Interface

Text-based chatbots had their moment. But in 2026, the businesses winning customer engagement are the ones letting users talk to their AI — literally. Voice agents combine the power of large language models with natural-sounding speech synthesis to create experiences that feel like speaking with a real person.

ElevenLabs has emerged as the leading platform for building these agents. Their Conversational AI platform handles the hard parts — ultra-low-latency speech synthesis, WebRTC audio transport, voice activity detection, and turn-taking — so developers can focus on business logic.

This guide walks you through building a production-grade AI voice agent from scratch: architecture decisions, implementation patterns, prompt engineering for voice, and deployment best practices.

🎯 What You'll Build

A real-time voice agent with WebRTC audio streaming
Server-side token authentication (no exposed API keys)
Client-side tool execution (book appointments, look up data)
Custom voice selection and personality tuning

Architecture Overview

A voice agent system has three layers:

Client (React) — Captures microphone audio, plays agent responses, handles UI state. Uses the @elevenlabs/react SDK.
Auth Server (Edge Function) — Generates short-lived conversation tokens so your API key never touches the browser.
ElevenLabs Platform — Handles speech-to-text, LLM reasoning, text-to-speech, and audio transport via WebRTC.

Client (mic audio) → WebRTC → ElevenLabs STT → LLM → TTS → WebRTC → Client (speaker)

The entire round-trip typically takes 500ms–1.2s, making conversations feel natural.

Step 1: Server-Side Token Generation

Never expose your ElevenLabs API key in client-side code. Instead, create a server endpoint that generates short-lived conversation tokens:

// Edge Function: elevenlabs-conversation-token

Receive the agentId from the client request
Call ElevenLabs' token endpoint with your server-side API key
Return the short-lived token to the client
Optionally fetch a WebSocket signed URL as fallback

This pattern is critical for production deployments. The token expires quickly, limiting exposure even if intercepted.

Step 2: React Client Implementation

The useConversation hook from @elevenlabs/react manages the entire WebRTC connection lifecycle:

Connection management — Handles WebRTC negotiation, ICE candidates, and reconnection
Audio capture — Requests microphone access and streams audio to ElevenLabs
Playback — Receives and plays synthesized speech through the browser
State tracking — Exposes status, isSpeaking, and volume levels

The basic flow: request mic permission → fetch token from your server → call startSession() with the token → the user starts talking.

Step 3: Client Tools — Making Your Agent Do Things

Voice agents become powerful when they can take actions, not just talk. ElevenLabs supports "client tools" — functions the agent can invoke during conversation:

Book an appointment — Agent collects date/time preferences, calls your scheduling API
Look up order status — Agent asks for order number, queries your database
Navigate the user — Agent directs to a specific page based on conversation context
Submit a lead form — Agent gathers name, email, needs — submits to your CRM

⚠️ Important

Client tools must be configured in the ElevenLabs web UI before they'll work in your code. Define the tool name, description, and parameter schema in the agent settings — the SDK handles the rest.

Step 4: Choosing and Customizing Voices

Voice selection is a brand decision, not just a technical one. ElevenLabs offers 30+ pre-built voices and the ability to clone custom voices.

Pre-Built Voice Selection Guide

Use Case	Recommended Voice	Why
Sales Agent	Chris / Sarah	Warm, conversational tone that builds trust
Tech Support	Daniel / Alice	Clear, authoritative, patient delivery
Customer Service	Laura / Liam	Friendly, empathetic, natural cadence
Executive Briefing	George / Matilda	Professional, polished, confident

Step 5: Prompt Engineering for Voice

Writing prompts for voice agents is fundamentally different from text chatbots:

Keep responses short — Aim for 1-3 sentences. Users can't "scan" voice like they scan text.
Use conversational language — "Got it!" beats "I understand your request."
Handle interruptions — Instruct the agent to gracefully yield when interrupted.
Confirm actions verbally — "I've booked that for 3pm Tuesday. Sound good?"
Avoid lists — Don't read off 5 options. Offer 2-3 and ask which direction to go.

Example System Prompt Structure:

You are [Name], a [role] for [Company]. Your personality is [traits]. Keep responses under 3 sentences unless the user asks for detail. When you need information, ask one question at a time. Always confirm before taking actions.

Step 6: Expressive Mode (New in 2026)

ElevenLabs launched Expressive Mode for ElevenAgents in February 2026. This isn't just better TTS — it's a fundamentally different approach to agent voice:

Emotional awareness — The agent adapts tone based on conversation context (empathetic when a customer is frustrated, enthusiastic when closing a deal)
Natural disfluencies — Subtle "um"s and breath patterns that make the voice feel human
Dynamic pacing — Speeds up for excitement, slows down for important information

To enable Expressive Mode, toggle it in the ElevenLabs agent configuration panel. No code changes required — it enhances the existing voice pipeline.

Production Deployment Checklist

☐ API key stored as server-side environment variable (never in client code)
☐ Token generation endpoint rate-limited
☐ Microphone permission requested with clear UX explanation
☐ Graceful fallback for browsers without WebRTC support
☐ Error handling for network drops and reconnection
☐ Analytics tracking for conversation starts, duration, and completion
☐ Volume controls accessible to users
☐ Mobile-responsive agent UI tested on iOS and Android

Common Pitfalls and How to Avoid Them

Pitfall	Solution
Agent talks too much	Add "keep responses under 3 sentences" to system prompt
Echo/feedback loops	Enable echo cancellation in microphone config
High latency on mobile	Use WebRTC (not WebSocket) and turbo model
Agent ignores interruptions	Tune VAD sensitivity; add interruption handling to prompt
Exposing API keys	Always use server-side token generation

Real-World Use Cases We're Seeing

After-hours receptionist — Voice agent handles calls when the office is closed, books callbacks for morning
Website sales concierge — Embedded voice widget qualifies leads through natural conversation
IT help desk tier-1 — Agent troubleshoots common issues (password resets, connectivity) before escalating
Appointment scheduling — Patients or clients book time slots through voice instead of clicking through calendars
Multilingual support — Single agent handles conversations in 29 languages using ElevenLabs' multilingual models

Ready to Build Your Voice Agent?

Whether you need a sales agent, support bot, or custom voice interface — our team builds production-ready voice agents that integrate with your existing systems.

From Concept to Deployed Voice Agent

We handle the architecture, voice selection, prompt engineering, and deployment — you get an AI agent that sounds like your brand.

Start Your Voice Agent Project

Build AI Voice Agents With ElevenLabs

Featured Article