Evaluation Gaps in Speech-to-Speech Generation
We have better speech models than ever and worse ways of measuring them. That's not a paradox — it's the natural result of the field moving faster than its benchmarks.
Most S2S evaluation still relies on metrics inherited from TTS and ASR: MOS scores, word error rate, speaker similarity cosine distance. These metrics were designed for systems that synthesize read speech from text. They measure clarity and intelligibility. They tell you almost nothing about whether a model can hold a conversation.
What we're not measuring
Consider what happens in a real conversation. A speaker hesitates. The listener produces a backchannel — a soft "mm-hm" — at exactly the right moment, with exactly the right prosodic shape. This is not a transcription problem. There is no "correct" text output. There is only an acoustically and temporally appropriate response.
Current benchmarks have no way to score this. They can tell you if a generated utterance is intelligible. They cannot tell you if it was right.
The same blind spot applies to emotional coherence (does the model's tone match the conversational context?), turn-taking precision (does it interrupt or leave awkward gaps?), and prosodic appropriateness (does a question sound like a question?). These are the dimensions that determine whether a voice model feels human. None of them have standardized evaluation protocols.
A proposed framework
We've been developing an evaluation framework internally that we're now proposing to the broader community. It scores S2S outputs across four axes:
Temporal coherence. Response latency distribution relative to natural conversation baselines. Overlap handling. Backchannel timing accuracy measured against annotated ground truth.
Prosodic appropriateness. Pitch contour alignment with discourse context. Emphasis placement. Speech rate adaptation to interlocutor patterns.
Emotional congruence. Valence-arousal trajectory matching between generated response and conversational context. Measured continuously, not per-utterance.
Conversational naturalness. A composite human evaluation protocol — not MOS, which asks "does this sound good?" but a structured rubric that asks "does this sound like a real conversational turn?" Calibrated across languages with anchored examples.
Why this matters now
The labs building S2S models need evaluation infrastructure that matches the ambition of the models. You can't optimize what you can't measure. And right now, the most important qualities of conversational speech are unmeasured.
We're releasing the evaluation protocol specification and reference annotations for English, Mandarin, and Spanish in Q2 2026. Additional languages will follow based on partner interest.
We're looking for research partners to validate and extend the framework. If you're working on S2S evaluation, we'd like to talk: hello@extrian.com.
← Back to all posts