Epoch
Every frontier lab is building speech-to-speech. The bottleneck is data.
Existing corpora were built for ASR and TTS. Read speech, single-speaker, designed for text-intermediary pipelines. S2S models learn directly from waveforms and need naturalistic multi-turn conversation with the full acoustic signal: prosody, overlap patterns, turn-taking dynamics, affective variation. That data doesn’t exist at scale.
Collection methodology
Speaker pairs in live two-party conversations with per-speaker channel isolation. Structured situational prompts only. Scripts collapse prosodic variance into read-speech distributions regardless of text naturalness.
Controlled capture stack: 48kHz sample rate, lossless codec, per-channel SNR validation, environment fingerprinting. Automated QA on every session: SNR thresholds, clipping detection, crosstalk energy ratios, VAD-based segmentation, Whisper-pass verification for transcript alignment.
Scale
ASR learns a narrow mapping from audio to text. S2S learns semantics, pragmatics, prosody, timing, and affect simultaneously from raw audio. Higher bandwidth, proportionally higher data requirement.
Epoch is millions of hours of conversational English, purpose-built for S2S foundation model training. Open to exclusive and non-exclusive licensing.
← Back to all posts