Toward Conversational Coherence in Audio-Native Models

A voice model that produces fluent, natural-sounding utterances in isolation can still fail completely at conversation. Fluency is not coherence. A beautiful sentence delivered at the wrong moment, in the wrong tone, after the wrong pause, is worse than a disfluent one that lands right.

Conversational coherence — the property that makes a sequence of turns feel like a dialogue rather than a series of monologues — is the unsolved problem in S2S. This post is about what coherence requires and how we think about building it into the data layer.


The structure of conversation

Conversation has structure that is almost entirely acoustic. Linguists have studied it for decades under the banner of Conversation Analysis. The findings are remarkably consistent across languages and cultures:

Turn transitions happen within a window of roughly 200 milliseconds. Speakers project the end of their turns through pitch, syntax, and pragmatic completion — and listeners begin formulating responses before the turn is over. Gaps longer than 700ms signal trouble. Overlaps are not errors; they're a feature of engaged dialogue, and their acoustic shape (rising pitch overlap vs. competitive overlap) carries meaning.

Backchannels — the "mm-hm"s and "yeah"s and sharp intakes of breath — are not noise. They're the listener's continuous signal that the conversation is alive. Remove them and dialogue collapses. Get their timing wrong by 300 milliseconds and the speaker feels unheard.

None of this appears in a transcript.

Why current models struggle

Most voice models today are turn-based. They wait for the user to finish, process the input, and generate a response. This is a fundamentally different interaction model from human conversation, where listening and speaking are continuous, overlapping processes.

Even full-duplex models — those that can listen and speak simultaneously — struggle with coherence because the training data doesn't encode it well. If your training set is thousands of hours of isolated utterances, your model will produce isolated utterances. Conversational coherence has to be in the data to end up in the model.

Building coherence into the data

Our approach is to annotate conversational structure explicitly and at scale. Every dialogue in our corpus is segmented not by utterance but by conversational move: turn-constructional units, transition-relevance places, backchannel clusters, repair sequences, and overlap regions. Each is timestamped to the millisecond and classified by type and function.

This gives S2S models something they've never had before: ground truth for conversational dynamics. Not just what was said, but the precise temporal and acoustic structure of how two people navigated a dialogue together.

Early results from partners training on this data show measurable improvements in turn-taking precision and backchannel timing. The models don't just sound better — they feel more present. More like someone is actually there.

The road ahead

Conversational coherence is where voice AI crosses from impressive to inhabitable. It's the difference between a demo and a product. Between a tool you use and an intelligence you coexist with.

We're still early. But the path is clear: better data, richer annotation, and architectures that can learn from the full acoustic structure of human dialogue. We're building all three.


We're actively collaborating with research teams on conversational S2S. Reach out at hello@extrian.com.

← Back to all posts