Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

Overview

ServiceNow AI researchers built a benchmark to test how well speech recognition systems transcribe code-switched speech, the everyday habit of bilingual people who swap languages mid-sentence. They ran seven frontier ASR systems across 918 synthetic utterances covering four language pairs. ElevenLabs Scribe V2 produced the best transcription accuracy, while OpenAI Whisper Large V3 Turbo finished last and often translated the speech instead of transcribing it.

Key Takeaways

Code-switching, where a speaker mixes two languages in one sentence, trips up many customer service voice agents, and this study measures the gap with a purpose-built benchmark.
Seven speech recognition systems were tested on 918 short utterances spanning Spanish-English, French-English, Canadian French-English, and German-English.
ElevenLabs Scribe V2 ranked first on every language pair and on the metrics that measure preserved meaning, not raw word matching.
OpenAI Whisper Large V3 Turbo came in last, with word error rates as high as 0.61, and tended to translate rather than transcribe.
For the strongest systems, the accuracy penalty for mixed-language speech versus single-language speech was small, suggesting code-switching is becoming a normal condition rather than an edge case.
ServiceNow released both the dataset and the AU-Harness evaluation framework so other teams can reproduce and extend the results.

Stats & Key Facts

#7 frontier ASR systems were benchmarked head to head.
#918 synthetic utterances made up the test set, each 12 to 40 words long.
#4 language pairs were covered: Spanish-English (259 utterances), French-English (298), Canadian French-English (188), and German-English (173).
#OpenAI Whisper Large V3 Turbo posted word error rates ranging from 0.16 to 0.61, the weakest scores in the field.
#Mistral AI Voxtral Small in the test carries 24 billion parameters, the largest open model included.
#3 comprehension questions per utterance fed the downstream Answer Error Rate metric.

Why Bilingual Callers Break Voice Agents

Code-switching is common in real conversations but rare in the data most speech systems learn from.

Bilingual speakers often switch languages in the middle of a sentence without thinking about it. A Spanish-English speaker might start a request in English and finish a phrase in Spanish. For a customer service voice agent, this is a frequent reason transcription breaks down and the caller gets misunderstood.

ServiceNow AI researchers wanted hard numbers on the problem instead of anecdotes. They built a dedicated benchmark focused only on code-switched speech, then ran the leading speech recognition systems against it to see which ones hold up.

Inside the 918-Utterance Code-Switched Benchmark

The test set was designed to stress mixed-language transcription across four bilingual combinations.

›Spanish-English: 259 utterances
›French-English: 298 utterances
›Canadian French-English: 188 utterances
›German-English: 173 utterances
›Each utterance ran 12 to 40 words and was generated through text-to-speech, then validated by native speaker linguists

Seven ASR Systems Put to the Test

The lineup mixed commercial APIs with open models.

›AssemblyAI Universal 3-Pro
›Deepgram Nova 3 Multilang
›ElevenLabs Scribe V2
›Google Gemini 3 Flash
›Mistral AI Voxtral Small 24B
›Nvidia Parakeet TDT 0.6b V3
›OpenAI Whisper Large V3 Turbo

ElevenLabs Scribe V2 Leads, Whisper Trails

Accuracy varied widely across the field.

ElevenLabs Scribe V2 delivered the best transcription accuracy across all four language pairs and also ranked first on the semantic metrics that track whether meaning survives. AssemblyAI Universal 3-Pro and Google Gemini 3 Flash competed for second place, with Gemini doing better on meaning-preservation scores despite weaker raw word matching.

OpenAI Whisper Large V3 Turbo finished last, posting word error rates from 0.16 to 0.61. A recurring failure was that it translated the spoken words into one language instead of transcribing them as spoken, which defeats the purpose for a voice agent that needs the caller's exact words.

How the Researchers Scored Accuracy

The study used three metrics so a single number would not hide the full picture.

›Word Error Rate (WER) measures exact transcription accuracy word for word.
›Semantic WER (SWER) counts only the errors that change meaning, ignoring trivial slips.
›Answer Error Rate (AER) tests downstream usefulness by asking three comprehension questions per utterance and checking whether the transcript supports correct answers.

Errors Cluster on the English Words

The most surprising finding ran against expectations.

Across every model and language pair, mistakes concentrated on the English portions of each utterance. That is counterintuitive, since English is usually the easier language to transcribe on its own. The act of switching appears to throw the systems off right at the boundary.

Two factors predicted trouble. The number of language switches in a sentence strongly predicted whether errors appeared, an effect that was especially clear for French-English. Once an error occurred, a measure called the Code-Mixing Index better predicted how large the error would be.

What This Means for Business Voice Tools

The practical takeaway is cautious optimism with a testing caveat.

For the strongest systems, the accuracy cost of code-switching versus single-language speech was small. ElevenLabs Scribe V2 in some cases beat its own single-language baseline. The authors read this as a sign that handling mixed-language speech is becoming a normal capability for the best frontier systems rather than a rare edge case.

Performance still ranged widely across models and language pairs, so the researchers advise organizations to test candidate systems against their own customer demographics before deploying. ServiceNow published the dataset on Hugging Face and the AU-Harness evaluation framework on GitHub so any team can rerun the comparison.

Frequently Asked Questions

What is code-switching in speech?

Code-switching is when a bilingual speaker mixes two languages within a single sentence or conversation. It is common in everyday speech and is the specific condition this benchmark tests speech recognition systems against.

Which speech recognition system performed best?

ElevenLabs Scribe V2 ranked first across all four language pairs and on the metrics that measure preserved meaning. AssemblyAI Universal 3-Pro and Google Gemini 3 Flash were the closest competitors.

Why did OpenAI Whisper Large V3 Turbo do poorly?

It finished last with word error rates from 0.16 to 0.61. A frequent problem was that it translated the spoken words into one language rather than transcribing them as actually said.

Can businesses access this benchmark?

Yes. ServiceNow released the dataset on Hugging Face and the AU-Harness evaluation framework on GitHub, so organizations can reproduce the tests and run their own model comparisons.

Should companies trust any top system for bilingual callers?

Not blindly. Although the best systems handled code-switching with a small accuracy penalty, results varied widely across models and language pairs, so the researchers recommend testing against your own customer demographics first.

The study shows that the leading speech recognition systems now handle mixed-language speech with only a small accuracy penalty, but the wide spread across models means businesses serving bilingual customers should benchmark candidates on their own audio before going live.