On the Role of Data in the S2S Capability Overhang
There is a capability overhang in voice AI. The architectures exist. The compute exists. What doesn't exist — not yet, not at the scale and fidelity required — is the data.
This post is about why speech-to-speech data is the binding constraint on the next generation of voice models, and what it takes to remove it.
The asymmetry
Text-based language models trained on the internet. Trillions of tokens, freely available, covering every domain, language, and register. The data was there. The architectures caught up. Scaling laws did the rest.
Speech has no equivalent. The internet is not full of high-quality, multi-speaker, naturalistic conversational audio with clean separation and rich metadata. What exists is podcasts (two speakers, unstructured), call center recordings (narrow domain, legal constraints), and read-aloud corpora (not conversational at all). None of this is what S2S models actually need.
The architectures are ready. The data is not. That's the overhang.
What S2S data actually requires
Training a speech-to-speech model that can hold a real conversation requires data with specific properties that are hard to find in the wild and hard to produce at scale:
Naturalistic dialogue. Not scripted, not read, not prompted with artificial scenarios. Real conversational dynamics — interruptions, repairs, laughter, tangents, the full mess of how people actually talk.
Clean speaker separation. Multi-speaker audio where each speaker's signal can be isolated. This is non-negotiable for training models that need to distinguish between self and other.
Acoustic diversity. Accents, ages, recording environments, emotional states. Models trained on studio-quality read speech from 25-to-35-year-old voice actors will not generalize.
Paralinguistic annotation. Emotion, prosody, turn-taking, discourse function. The metadata layer that transforms raw audio into training signal for the behaviors that matter most.
Scale. Not thousands of hours. Millions. The scaling laws that held for text hold for speech. We're nowhere near the data-saturated regime.
The collection problem
You cannot scrape conversational audio the way you scrape text. Every hour of high-quality dialogue data requires infrastructure: speaker recruitment, consent, recording hardware, quality control, annotation, and legal review. There are no shortcuts. It's an operations problem as much as a research problem.
This is Extrian's core competence. We've built the infrastructure to collect, process, and annotate conversational speech at a scale that frontier labs need but cannot efficiently build in-house. Our collection pipelines run across dozens of languages simultaneously, with per-language quality calibration and continuous annotator training.
Unlocking the overhang
The labs that get access to the right data first will build the best voice models. This isn't speculation — it's the same dynamic that played out in text. Data quality and scale were the differentiators, and they will be again.
The capability overhang in voice is real. The architectures are waiting. We're building what they're waiting for.
We're working with frontier labs on both exclusive and non-exclusive data access. If you're building S2S models, let's talk: hello@extrian.com.
← Back to all posts