Voice AI is the breakthrough of 2025-2026
Call Dutch Railways customer service at 9:30pm and for a few years now you get a voice that does a decent job. Until you ask whether you can take your bike on tomorrow's train instead of your dog. Then you fall back on a human. That moment is the real test for voice AI: not the demo where the voice sounds natural, but the production call where the caller asks something off-script. In 2025-2026 that test is for the first time passable for a specific set of use cases. Speech recognition, language models and speech synthesis have all matured. For Dutch SMBs that opens out-of-hours reception, intake at high volume, first-line customer service, appointment confirmations, and in some sectors sales discovery.
This is not a shallow intro. It is the full explanation: how voice AI works technically, which tools you use, which use cases are truly production-ready, how to integrate it with your telephony, what it costs, how it relates to the AI Act, and (importantly) when you should not use it.
What is voice AI?
Voice AI is software that holds a spoken conversation. Under the hood voice agents have three layers that either work separately (classic pipeline) or as one integrated model (next-gen voice models like OpenAI Realtime or Gemini Live).
1. ASR (Automatic Speech Recognition). Speech to text. Tools: OpenAI Whisper, AssemblyAI, Deepgram. Dutch quality on clear phone audio in 2025-2026 is very good; error rate sits below 5%. Whisper-Large-v3 does Dutch dictation neatly. NVIDIA's Parakeet is faster on benchmarks but with markedly more errors on Dutch. Test on your own audio before you pick.
2. LLM (Large Language Model). Understanding text input, planning and formulating an answer. Tools: GPT-4o, Claude, Gemini. The brain of the agent. It understands the question, optionally consults a knowledge base or API, decides what action to take.
3. TTS (Text To Speech). Text to natural-sounding speech. Tools: ElevenLabs, OpenAI Voice, PlayHT, Cartesia. Dutch voices broke through in 2025-2026; with top tools the difference from a human voice is only audible if you listen closely.
In the classic pipeline these three steps run in series: user speaks, ASR, LLM, TTS, reply. Latency: 1.5 to 3 seconds, sometimes noticeable in a phone call.
In next-gen integrated voice models (OpenAI Realtime, Gemini Live, some Anthropic experiments) it all sits in one model: speech in, speech out. Latency drops to 300-700 ms, comparable to natural human conversation. For most production use cases a hybrid approach (classic for robustness, integrated where the conversational flow needs speed) is the most practical.
How does it work in production?
A production voice agent has six components. First the telephony integration: the agent has to hang on a number. For the Netherlands this works via SIP trunking (Twilio, Telnyx, Vonage), via cloud PBX integration (RingCentral, Aircall, 3CX), or via WebRTC for browser calls. Then a voice platform or orchestrator that glues telephony, ASR, LLM and TTS together and runs the conversational flow: Vapi, Retell AI, Bland.ai or ElevenLabs Conversational AI. Going deeper: build on LiveKit Agents or Pipecat (open source).
Underneath sits the knowledge base or RAG: a searchable vector database (Pinecone, Weaviate, Postgres+pgvector) filled with your SOPs, FAQs, product info or website content. The agent reads from here. Alongside that the tool integrations: API calls the agent makes during a conversation, like booking an appointment in calendar, fetching customer data from CRM, creating a ticket, looking up an invoice.
Two things you forget in every pilot and need badly in production. One: human escalation. When the agent doubts or runs outside scope, it must transfer smoothly to a human, with context handover. Not a luxury, a must. Two: monitoring and transcripts. Every call is recorded, transcribed, scored and on issues automatically flagged for human review. Without this you cannot guard quality and cannot deliver an audit trail.
For the broader agent context see /en/ai-agents.
Use cases that work in production
Concrete examples running in production at Dutch SMBs in 2025-2026.
Out-of-hours reception is the clearest win. An AI voice picking up the phone after 17:00, answering first-line questions (opening hours, prices, availability), booking appointments and routing urgent items to an emergency number. For practices (GP, dentist, physio), hotels and service organisations typically 30 to 60% volume reduction during day hours.
Intake calls are second. An agent giving new leads or patients a structured intake (contact info, complaint, urgency, preferred provider), writing answers into the system, and probing on multi-faceted questions. Works well for clinics, legal advice practices and intake-heavy sectors.
First-line customer service does the first two minutes of the call: identification, complaint categorisation, simple FAQ answers directly, more complex routed to a human with full context. Saves 40 to 60% of volume from human agents. Appointment confirmations and reminders can take over 80% of the calling work assistants currently do at care practices. Sales discovery works selectively: for simple-tech B2B products an agent can ask qualification questions to inbound leads and book to the right account manager. Not for consultative selling. And booking and reservation for restaurants, hairdressers, car rental or kayak rental: high volumes with simple variation are the sweet spot.
For customer-service-specific see /en/ai-customer-service.
Which tools do you use?
The Dutch 2026 voice AI landscape has roughly four tiers.
Tier 1: voice platforms (orchestrators). Vapi.ai is most used by technical teams, programmable via API, supports any ASR-LLM-TTS combination. Strong for custom, thinner on no-code. Retell AI is similar, slightly more plug-and-play for enterprise, with good call recording and analytics built in. Bland.ai targets outbound (sales and service calls) with lower latency, stronger for big outbound campaigns. ElevenLabs Conversational AI is their own orchestrator with their voices built in, strong on voice quality, more limited on tooling flexibility.
Tier 2: TTS engines. ElevenLabs is dominant for Dutch voices. Voice cloning, multispeaker, emotions, regional accents. OpenAI Voice via the Realtime API gives low latency and decent NL quality. Cartesia / Sonic is a newcomer, low latency, suitable for real-time. Google and Azure TTS are enterprise-grade with EU hosting, slightly less natural than ElevenLabs in NL.
Tier 3: ASR engines. Deepgram is fast, accurate for Dutch, EU deployment available. AssemblyAI is comparable, strong on real-time. OpenAI Whisper delivers good quality with higher latency than Deepgram or AssemblyAI, fine for non-real-time transcripts. For Dutch dialects (West Flemish, broad Limburgish, Twents) WER quickly climbs to 10-15%, where standard Dutch keeps it under 5%. Always test on your own caller population.
Tier 4: open source / build your own. LiveKit Agents (WebRTC + agent framework, popular for those wanting full control), Pipecat (Python framework from Daily.co, strong for multi-modal agents), or Whisper plus LLM of choice plus open source TTS for those wanting everything on own infrastructure (healthcare, defence, financial).
In practice: for 80% of SMB use cases Vapi or Retell + ElevenLabs + Deepgram + GPT-4o or Claude is the working combination. For strict GDPR or compliance requirements you go to EU-hosted or open source on your own stack.
Evaluation criteria: how do you know a voice agent works well?
Eight metrics that count in production.
1. WER (Word Error Rate). What percentage of spoken words does ASR misinterpret? Good: <5% for clear phone audio, NL.
2. End-to-end latency. Time between user stops talking and agent starts replying? Good: <1 sec integrated, <2 sec classic pipeline.
3. Task success rate. What percentage of calls reach the conversational goal (appointment booked, question answered, complaint correctly routed)? Good: >85% for defined use cases.
4. Escalation rate. What percentage transfers to a human? No fixed norm; first-line averages 20-40%, fine with smooth transfer.
5. Sentiment score per call. Did the caller experience positive, neutral or negative? Track trends, not individual calls.
6. CSAT (Customer Satisfaction Score). Short post-call survey (3 questions via SMS). Production quality: minimum 4.0 out of 5.0.
7. Hallucination rate. What percentage of calls contain a factually incorrect agent answer? Target: <2%; requires tight RAG and monitoring.
8. AHT (Average Handle Time). Average call length? Compared to human handle time gives the real ROI.
AI Act and voice AI
For voice AI at least three AI Act provisions touch your organisation.
Art. 4 (AI literacy) requires that anyone working with the voice agent or assessing its output knows how it works and what can go wrong. Document training and policy. See /en/ai-training.
Art. 50 (Transparency) is the one that matters most in production. AI interacting with humans must make clear it is AI. For voice that means: "Hello, you are speaking with our AI assistant" as opening, or a comparable indication within ten seconds. Not negotiable. Concealment can mean fines.
High-risk classification applies for voice AI in HR (telephone pre-screening of candidates), in healthcare (medical triage or diagnostic questions) and in critical infrastructure. There you likely fall in high risk, which makes risk management, dataset control, technical documentation and human escalation mandatory. For the broader AI Act context see /en/ai-act; for self-scan see /en/ai-act-checker.
On top of that GDPR still applies. Voice recordings and transcripts are personal data. Informed consent, retention policy and EU hosting are standard, not luxury.
Integration with your telephony
For most Dutch organisations voice AI sits alongside, not instead of, existing telephony. Three routes.
A direct number for the agent is the simplest setup. A new 088 or geographic number routing 24/7 or out-of-hours to the AI agent. Works with Twilio, Telnyx or Vonage SIP trunks.
Routing inside your PBX is cleaner if you already run cloud PBX. Aircall, RingCentral, 3CX and most vendors now support an AI reception integration. Inbound calls go to the agent first, the agent transfers to humans based on conversation. Works with your existing numbers and routing rules.
Hybrid with human-first transfer fits sectors where the first voice has to be a human (GP practices, lawyers). During office hours the human picks up and hits a key to transfer to the agent for specific tasks (e.g. "book this appointment"). For introduction in conservative organisations often the most accepted route.
Voice choice and brand identity
A small detail that matters more than people think: which voice does your agent use? Three routes.
Pre-set voices from Vapi, Retell or ElevenLabs are fast to deploy and offer dozens of Dutch options. No extra work, but the same voice that other companies use. Fine for reception. For brand-distinctive applications it feels generic.
Voice cloning on a voice actor is the logical pick for brands with an audio identity (radio spots, podcast intro). With ElevenLabs Professional Voice Cloning you record a voice actor and use that voice as your agent. Cost: 250 to 1,000 euro for an actor recording 30 minutes, plus an ElevenLabs Pro subscription. Result: a unique voice only your company uses.
Voice cloning on an own employee is technically the same but uses a colleague as voice source. Works for internal agents (HR bot, IT helpdesk bot) where voice familiarity gives comfort. Do not forget explicit consent and a written agreement for what happens if the employee leaves.
In all three routes a short audio transparency line ("you are speaking with our AI assistant") is mandatory under the AI Act. A cloned voice on a live employee without that indication is misleading per Art. 50.
An honest word on quality issues
Voice agents in 2025-2026 break consistently on four points. Interruptions are number one. If the caller speaks over your sentence midway, the agent has to stop, listen and replan. The good orchestrators (Vapi, Retell) handle this out of the box. In self-build you guaranteed forget this in pilot 1 and the production system feels annoying.
Background noise is number two. Cars, construction, kids in the background: ASR stumbles and the agent replies oddly. Invest in good VAD (Voice Activity Detection) and a silence-aware fallback. Code switching is number three. Callers alternating Dutch and English, or standard Dutch and regional dialect, cause more errors. Pick an ASR that is explicitly multilingual and test on your real caller population. For a call centre that takes a lot of Limburg or Brabant on the line this is not a detail.
Regional data is number four and the most underestimated. Postcodes, Dutch street names with diacritics, numbers in spoken form ("twenty seven" vs "27"): the agent must normalise these before posting to API or CRM. Forget this and you come back to a human to manually correct what the agent misheard. That is the kind of cleanup that quietly drains the business case.
What does voice AI cost?
Three cost components.
Per-minute cost during calls: ASR (typically $0.005-0.01/min), LLM tokens (per call $0.05-0.20), TTS (typically $0.05-0.15/min for ElevenLabs), telephony (Twilio inbound about $0.01/min). Total: €0.20-0.50 per call minute. For 10,000 minutes/month that puts you at €2,000-5,000.
Build cost is a defined project; ask for a fixed price after a discovery call. Maintenance usually runs as a retainer because a voice agent is not set-and-forget, prompt tuning and knowledge-base updates stay on the to-do list.
ROI math: an agent handling 5,000 minutes/month of routine volume replaces well over one FTE in customer-service salary+overhead. Break-even within three months for SMBs with enough call volume. Too little volume and you mostly pay retainer for an agent sitting idle.
When NOT voice AI?
Three scenarios where I steer clients away from voice AI.
Low volumes: building a voice agent for 100 calls a month is overengineering. A good human reception or an async channel (chatbot, mail form) is then cheaper and more effective.
High-emotional content: for grief care, heavy legal matters, mental-health helplines and dismissal calls, do not use an AI voice as first contact. The social and ethical damage outweighs the efficiency gain. Everyone knows the feeling of getting an IVR at the moment you need a human.
Complex consultative conversation: strategic sales, complex technical helpdesk that needs deep probing, custom legal advice. A voice agent can qualify and route but not replace. Do not try. The Dutch Railways comparison at the start of this piece is exactly that: as soon as the question lies outside the script, you lose the conversation.
Voice AI vs chatbot vs receptionist: an honest comparison
For many SMBs the real question is not "which voice tool", but "which channel". Three alternatives and when they win.
Classic human reception wins on quality for low volumes and complex conversations. Up to 50 calls a day, or for calls averaging more than 5 minutes, a good human reception is cost-comparable and qualitatively stronger. Above that crossover the human becomes the bottleneck.
A text chatbot or WhatsApp bot wins on asynchronous channel traffic and international reach. Clients sending a message do not expect an instant reply. Callers do. For B2B organisations a chatbot is often enough. For B2C service voice is usually needed because the audience prefers calling.
The voice AI agent wins on high volumes with routine conversations, and on 24/7 reach without night shift. ROI is highest exactly where human reception is expensive and unscalable.
In practice: combine. Voice AI for the first 60-90 seconds, human for escalation, chatbot for asynchronous questions. A good architecture has no one-size-fits-all channel.
How do you start?
Three steps for SMBs considering voice AI.
Inventory use cases. Not "we want voice AI", but "our reception gets 200 calls a day, of which 60% goes to voicemail out of hours". Per use case: volume, call duration, percentage of routine questions, escalation patterns. A free Quickscan via /en/ai-scan maps this in an hour.
Start defined. First use case in six weeks in production with limited scope (only book appointments + opening hours + transfer, for example). Measure all eight criteria (WER, latency, task success, escalation, sentiment, CSAT, hallucination, AHT) two weeks post go-live. Iterate.
Scale only what works. If the first use case shows stable numbers, expand to the next (intake, reminders, confirmations). If the first does not work, you have invested six weeks instead of six months, and you know where the limit sits. Stop, learn, pick a different use case is not failure, that is steering.
Conclusion
Voice AI is in 2026 production-ready for a specific set of use cases at Dutch SMBs: out-of-hours reception, intake at high volume, first-line customer service, appointment confirmations, simple bookings. For those use cases the tech stack is mature, the compliance path clear, and ROI achievable within three to six months.
For other use cases (high-emotional, complex advice, low volumes) voice AI in 2026 is not the right tool. A good agency says so. AI strategy without pilots is not a strategy, it is a report. Start small, measure hard, and pick channels based on what the caller wants, not what the tech can do.
Want to know if voice AI fits your situation? Schedule a free discovery call. For the full agent approach and the tools I work with, see /en/ai-agents. For customer-service-specific voice implementations see /en/ai-customer-service. For the AI Act context see /en/ai-act.
Curious what AI can do for your business?
Take the free AI Scan and find out in 1 minute.
Start the AI Scan →