Voice AI in 2026: a complete guide for Dutch businesses
Laurens van Dijk
Founder, DataDream
Voice AI is the breakthrough of 2025-2026
A few years ago an AI voice still sounded robotic and latency was so high that a conversation felt unnatural. By 2025 the three underlying techniques (speech recognition, language models, speech synthesis) have become so fast and good that a well-built voice agent in a normal phone call is barely distinguishable from a human. For Dutch SMB organisations that opens a new automation category: out-of-hours reception, intake calls at high volume, first-line customer service, appointment confirmations, and in some sectors even sales discovery calls.
This guide is not a shallow intro. It is the full explanation: how voice AI works technically, which tools you use, which use cases are truly production-ready, how to integrate with your telephony, what it costs, how it relates to the AI Act, and (importantly) when you should not use it.
What is voice AI?
Voice AI is software that can hold a spoken conversation. Under the hood voice agents have three layers that either work separately (classic pipeline) or as one integrated model (next-gen voice models like OpenAI Realtime or Gemini Live):
1. ASR (Automatic Speech Recognition). Converting speech to text. Tools: OpenAI Whisper, AssemblyAI, Deepgram. Dutch quality in 2025-2026 is very good; error rate for clear phone audio sits below 5%.
2. LLM (Large Language Model). Understanding text input, planning and formulating an answer. Tools: GPT-4o, Claude, Gemini. The "brain" of the agent: it understands the question, optionally consults knowledge base or APIs, decides which action is needed.
3. TTS (Text To Speech). Converting text output to natural-sounding speech. Tools: ElevenLabs, OpenAI Voice, PlayHT, Cartesia. Dutch voices broke through in 2025-2026; with top tools the difference from a human voice is only audible if you listen closely.
In the classic pipeline these three steps run in series: user speaks -> ASR -> LLM -> TTS -> reply. Latency: 1.5 to 3 seconds, sometimes noticeable in a phone call.
In next-gen integrated voice models (OpenAI Realtime, Gemini Live, some Anthropic experiments) it all sits in one model: speech in, speech out. Latency drops to 300-700 ms, comparable to natural human conversation. For most production use cases a hybrid approach (classic for robustness, integrated where the conversational flow needs speed) is the most practical.
How does it work in production?
A production voice agent typically has six components:
1. Telephony integration. The agent has to hang on a phone number. For the Netherlands this works via SIP trunking (Twilio, Telnyx, Vonage), via cloud PBX integration (RingCentral, Aircall, 3CX), or via WebRTC for browser calls.
2. Voice platform or orchestrator. The layer that glues telephony, ASR, LLM and TTS together and runs the conversational flow. Tools: Vapi, Retell AI, Bland.ai, ElevenLabs Conversational AI. Or going deeper: build on LiveKit Agents or Pipecat (open source frameworks).
3. Knowledge base or RAG. A searchable source of company information that the agent answers from. Standard this is a vector database (Pinecone, Weaviate, Postgres+pgvector) filled with your SOPs, FAQs, product info or website content.
4. Tool integrations. API calls the agent can make during a conversation: book an appointment in calendar, fetch customer data from CRM, create a ticket, look up an invoice.
5. Human escalation. When the agent doubts or runs outside scope, transfer to a human. Not a luxury; for production quality a smooth warm transfer (with context handover) is a must.
6. Monitoring and transcripts. Every call is recorded, transcribed, scored and (on issues) automatically flagged for human review. Without this you cannot guard quality and cannot deliver an audit trail.
For the broader agent context see /en/ai-agents.
Use cases that work in production
Concrete examples we run in production with Dutch clients in 2025-2026:
1. Out-of-hours reception. An AI voice picking up the phone after 17:00, answering first-line questions (opening hours, prices, availability), booking appointments and routing urgent items to an emergency number. For practices (GP, dentist, physio), hotels, and service organisations typically 30-60% volume reduction during day hours.
2. Intake calls. An agent giving new leads or patients a structured intake (contact info, complaint, urgency, preferred provider), writing answers into the system, and probing on multi-faceted questions. For clinics, legal advice practices and intake-heavy sectors.
3. First-line customer service. An agent doing the first two minutes of a customer service call: identification, complaint categorisation, simple FAQ answers directly, more complex routed to a human with full context. Saves 40-60% of volume from human agents.
4. Appointment confirmations and reminders. Outbound calls confirming, reminding or rescheduling. People who say "yes" get an SMS confirmation, those who want to reschedule get a new slot. For care practices this can take over 80% of the calling work assistants currently do.
5. Sales discovery (selective). For specific simple-tech B2B products an AI agent can ask qualification questions to inbound leads and book to the right account manager. Not for complex consultative selling.
6. Booking and reservation. Restaurants, hairdressers, car rental, kayak rental: an agent walking "when do you want to come, how many people, which service" and booking into calendar software. Works for high volumes with simple variation.
For customer service specific see /en/ai-klantenservice; the broader AI customer service approach that integrates voice.
Which tools do you use?
The Dutch 2026 voice AI landscape has roughly four tiers:
- Vapi.ai: most used by technical teams, programmable via API, supports any ASR-LLM-TTS combination. Strong for custom, thinner on no-code.
- Retell AI: similar to Vapi, slightly more plug-and-play for enterprise. Good call recording and analytics built in.
- Bland.ai: more focused on outbound (sales and service calls). Lower latency, stronger for big outbound campaigns.
- ElevenLabs Conversational AI: ElevenLabs' own orchestrator with their voices built in; strong voice quality, limited tooling flexibility.
- ElevenLabs: dominant for Dutch voices. Voice cloning possible, multispeaker, emotions, regional accents.
- OpenAI Voice: Realtime API, low latency, decent NL quality.
- Cartesia / Sonic: newcomer, low latency, good for real-time.
- Google / Azure TTS: enterprise-grade, EU hosting available, slightly less natural than ElevenLabs in NL.
- Deepgram: fast, accurate for Dutch, EU deployment available.
- AssemblyAI: comparable; strong on real-time.
- OpenAI Whisper: good quality, higher latency than Deepgram/AssemblyAI; suitable for non-real-time transcripts.
- LiveKit Agents: WebRTC + agent framework, popular for those wanting full control.
- Pipecat: Python framework from Daily.co, strong for multi-modal agents.
- Whisper + LLM of choice + open source TTS: for those wanting everything on own infrastructure (healthcare, defence, financial).
In practice: for 80% of SMB use cases Vapi or Retell + ElevenLabs + Deepgram + GPT-4o or Claude is the working combination. For strict GDPR or compliance requirements you go to EU-hosted or open source on your own stack.
Evaluation criteria: how do you know a voice agent works well?
Eight metrics we measure in production:
1. WER (Word Error Rate). What percentage of spoken words does ASR misinterpret? Good: <5% for clear phone audio, NL.
2. End-to-end latency. Time between user stops talking and agent starts replying? Good: <1 sec integrated, <2 sec classic pipeline.
3. Task success rate. What percentage of calls reach the conversational goal (appointment booked, question answered, complaint correctly routed)? Good: >85% for defined use cases.
4. Escalation rate. What percentage transfers to a human? No fixed norm; first-line averages 20-40%, fine with smooth transfer.
5. Sentiment score per call. Did the caller experience positive, neutral or negative? Track trends, not individual calls.
6. CSAT (Customer Satisfaction Score). Short post-call survey (3 questions via SMS). Production quality: minimum 4.0 out of 5.0.
7. Hallucination rate. What percentage of calls contain a factually incorrect agent answer? Target: <2%; requires tight RAG and monitoring.
8. AHT (Average Handle Time). Average call length? Compared to human handle time gives the real ROI.
AI Act and voice AI
For voice AI at least three AI Act provisions touch your organisation:
Art. 4 (AI literacy). Anyone working with the voice agent or assessing its output must know how it works and what can go wrong. Document training and policy. See /en/ai-training.
Art. 50 (Transparency). AI interacting with humans must make clear it is AI. For voice that means: "Hello, you are speaking with our AI assistant" as opening, or a comparable indication in the first ten seconds. Not negotiable; concealment can mean fines.
High-risk classification. For voice AI in HR (telephone pre-screening of candidates), in healthcare (medical triage or diagnostic questions) or in critical infrastructure, you likely fall in high risk. That means risk management, dataset control, technical documentation and human escalation are mandatory. For the broader AI Act context see /en/ai-act; for self-scan see /en/ai-act-checker.
Plus GDPR obligations have not gone away. Voice recordings and transcripts are personal data. Informed consent, retention policy and EU hosting are standard (not luxury).
Integration with your telephony
For most Dutch organisations voice AI sits alongside (not instead of) existing telephony:
Option 1: Direct number for the agent. A new 088 or geographic number routing 24/7 or out-of-hours to the AI agent. Simplest setup; works with Twilio, Telnyx or Vonage SIP trunks.
Option 2: Routing inside your PBX. Aircall, RingCentral, 3CX and most cloud PBX vendors now support an "AI reception" integration where inbound calls go to the agent first, and the agent transfers to humans based on conversation. Works seamlessly with existing numbers and routing rules.
Option 3: Hybrid with human-first transfer. During office hours: human picks up, hits a key to transfer to the agent for specific tasks (e.g. "book this appointment"). For some sectors (GP practices, lawyers) this is the most accepted introduction.
Voice choice and brand identity
A small detail that matters more than people think: which voice does your agent use? In 2025-2026 there are three routes:
1. Pre-set voices. Vapi, Retell and ElevenLabs offer dozens of Dutch voices off the shelf. Fast, no extra work, but the same voice that other companies use. For reception roles that is fine; for brand-distinctive applications it feels generic.
2. Voice cloning on a voice actor. With ElevenLabs Professional Voice Cloning you record a voice actor and use that voice as your agent. Cost: 250-1,000 euro for a voice actor recording 30 minutes, plus ElevenLabs Pro subscription. Result: a unique voice only your company uses. For brands that already have an audio identity (radio spots, podcast intro) it is the logical pick.
3. Voice cloning on an own employee. Technically the same, but you use a colleague as voice source. Works for internal agents (HR bot, IT helpdesk bot) where voice familiarity gives comfort. Do not forget explicit consent and a written agreement for what happens if the employee leaves.
In all three routes a short audio transparency line ("you are speaking with our AI assistant") is mandatory under the AI Act. A cloned voice on a live employee without this indication is misleading per AI Act Art. 50.
An honest word on quality issues
Experience teaches that voice agents in 2025-2026 consistently break on four points:
1. Interruptions. If the caller speaks over your sentence midway, the agent has to stop, listen and replan. Good orchestrators (Vapi, Retell) handle this out of the box; in self-build you guaranteed forget this in pilot 1 and the production system feels annoying.
2. Background noise. Cars, construction, kids in the background: ASR stumbles and the agent replies oddly. Invest in good VAD (Voice Activity Detection) and a silence-aware fallback.
3. Code switching. Callers alternating Dutch and English or Dutch and regional dialect cause more ASR errors. For production: pick an ASR that is explicitly multilingual and test on your real caller population.
4. Regional data. Postcodes, Dutch street names with diacritics, numbers in spoken form ("twenty seven" vs "27"): the agent must normalise these before posting to API or CRM. Forget this and you come back to a human to manually correct what the agent misunderstood.
What does voice AI cost?
Three cost components:
1. Build / setup. A defined voice agent for one use case typically takes two to six weeks of work. At a specialised agency that is €8,000 to €25,000 for the first working version, depending on complexity (knowledge base size, tool integrations, multi-flow scripts).
2. Per-minute cost. During calls you pay for ASR (typically $0.005-0.01/min), LLM tokens (per call $0.05-0.20), TTS (typically $0.05-0.15/min for ElevenLabs), and telephony (Twilio inbound about $0.01/min). Total: €0.20-0.50 per call minute. For 10,000 minutes/month that puts you at €2,000-5,000.
3. Maintenance. Voice agents are not "set and forget". Plan a fixed monthly retainer for prompt tuning, monitoring, quality review, knowledge base updates. Typically €1,000-3,000/month for a production agent in active use.
ROI math: an agent handling 5,000 minutes/month for €1,500 (per-minute + retainer) replaces about 1.5 FTE customer service costing €4,500 in salary+overhead. Break-even within three months for SMBs with enough call volume.
When NOT voice AI?
Honest advice: three scenarios where we steer clients away:
1. Low volumes. Building a voice agent for 100 calls a month is overengineering. A good human reception or an async channel (chatbot, mail form) is then cheaper and more effective.
2. High-emotional content. For grief care, heavy legal matters, mental-health helplines, dismissal calls: do not use an AI voice as first contact. The social and ethical damage outweighs the efficiency gain.
3. Complex consultative conversation. Strategic sales, complex technical helpdesk needing deep probing, custom legal advice: a voice agent can qualify and route but not replace. Do not try.
Voice AI vs chatbot vs receptionist: an honest comparison
For many SMBs the real question is not "which voice tool", but "which channel". Three alternatives and when they win:
Classic human reception. Wins on quality for low volumes and complex conversations. Up to 50 calls a day, or for calls averaging more than 5 minutes, a good human reception is cost-comparable and qualitatively stronger. Above that crossover the human reception becomes the bottleneck.
Text chatbot or WhatsApp bot. Wins on asynchronous channel traffic and international reach. Clients sending a message do not expect instant reply; callers do. For B2B organisations a chatbot is often enough; for B2C service voice is usually needed because the audience prefers calling.
Voice AI agent. Wins on high volumes with routine conversations, and on 24/7 reach without night shift. ROI is highest exactly where human reception is expensive and unscalable.
In practice: combine. Voice AI for the first 60-90 seconds, human for escalation, chatbot for asynchronous questions. A good architecture does not have a one-size-fits-all channel.
How do you start?
Three steps for SMBs considering voice AI:
Step 1: Inventory use cases. Not "we want voice AI", but "our reception gets 200 calls a day, of which 60% goes to voicemail out of hours". Per use case: volume, call duration, percentage of routine questions, escalation patterns. A free Quickscan via /en/ai-scan maps this in an hour.
Step 2: Start defined. First use case in six weeks in production with limited scope (e.g. only book appointments + opening hours + transfer). Measure all eight criteria (WER, latency, task success, escalation, sentiment, CSAT, hallucination, AHT) two weeks post go-live. Iterate.
Step 3: Scale only what works. If the first use case shows stable numbers, expand to the next (intake, reminders, confirmations). If the first does not work, you have invested six weeks instead of six months, and you know where the limit sits.
Conclusion
Voice AI is in 2026 production-ready for a specific set of use cases at Dutch SMB organisations: out-of-hours reception, intake at high volume, first-line customer service, appointment confirmations, simple bookings. For those use cases the tech stack is mature, the compliance path clear, and ROI achievable within three to six months.
For other use cases (high-emotional, complex advice, low volumes) voice AI in 2026 is not the right tool. A good agency says so.
Want to know if voice AI fits your situation? Schedule a free discovery call. For the full agent approach and the tools we work with, see /en/ai-agents. For customer-service-specific voice implementations see /en/ai-klantenservice. For the AI Act context see /en/ai-act.
Curious what AI can do for your business?
Take the free AI Scan and find out in 1 minute.
Start the AI Scan →