Crossing the Canny Valley of Voice: Introducing Cross-Session Multilingual Speech Modeling

March 2, 2025

In summary, Sesame AI introduces Contextual Speech Modeling (CSM), a large-scale generative model for creating realistic conversational speech. CSM achieves impressive results in naturalness but still faces challenges with prosody and multilingual support. Future plans include scaling up models further, expanding language coverage to over 20 languages, utilizing pretrained language models, and exploring fully duplex AI conversations that mimic human dynamics better. Sesame aims to open-source key components of their research for community collaboration.

Limitations include English focus with limited multilingual capabilities due to dataset limitations and lack of leveraging pretrained language model weights. The ultimate goal is creating models capable of understanding conversation structure beyond just speech content, which requires advancements across data curation, modeling techniques, and post-training methodologies.

Complete Article after the Jump: Here!

In

Why categories when I’ve Tags!