Highlights from Our Conversation with Cerebral Valley

    Our conversation with Cerebral Valley

    I recently had the chance to speak with Cerebral Valley about what we’re building at Rime, where voice AI is headed, and why speech synthesis is having its moment right now.

    How Rime Got Started

    Before Rime, I was a PhD student at Stanford studying Computational Linguistics. I dropped out in 2023 to build better speech models for real-world applications.

    People in Texas sound different from people in California, and we all pick up on these cues. I was deep into sociophonetics [the study of how social and demographic factors shape speech]...
    End-to-end attention-based models were getting so good that real-time, near-human speech was starting to feel within reach.

    After months of moonlighting on what would eventually become Rime, I left Stanford and convinced two friends to join me as co-founders: Brooke, who was at Amazon Alexa, and Ares, who was at UC San Francisco working on brain-computer interfaces for people who had lost the ability to speak.

    What Makes Rime Different

    We’re not just building voices. We’re helping companies run high-volume, real-time voice applications that actually work, whether that’s phone ordering for Domino’s, confirming healthcare info, or automating support calls.

    One of our partners, ConverseNow, powers about 80% of all Wingstop and Domino’s phone orders in North America. So there’s a three-in-four chance that if you call Domino’s or Wingstop in North America, you’re hearing our voices.”

    “In the healthcare use case, for example, a model needs to be able to pronounce things like a member ID number in the same way a human would. That’s still a really hard problem for these models. So we’ve been heavily focused on the data collection effort to support more typical enterprise calling, where 80% of the call is just confirming information. For example, ‘Let me make sure I’ve got this right. Your name is Patrick, spelled P-A-T-R-I-C-K.’ Simple problems, but solving them well makes a huge difference for businesses handling hundreds of thousands of calls a day.

    Built for Customization

    Rime isn’t just building voices, we’re building tools that let developers customize and fine-tune speech output without needing any background in phonetics or audio.

    You simply can’t use most of the other text-to-speech products on the market. They can’t even pronounce "Sbarro" correctly. That’s the kind of problem we’re solving. If a model can’t get the name of the business or the products and specials right, that sounds like a first-mile problem to me, not a last-mile problem.

    “So much of what we do falls into two buckets: ‘linguistics as a service’ and ‘demographics as a service.’ Developers shouldn’t need to know the International Phonetic Alphabet just to fix pronunciation.

    Real Business Impact

    Rime doesn’t just improve how bots sound, we improve how they perform.

    When one of our customers switches from Microsoft to Rime, they immediately see 20% more calls getting automated. That represents millions in revenue.

    Better voices = more engagement, higher automation, and a smoother customer experience.

    Rime Team Culture

    With a small team powering tens of millions of conversations each month, we run lean and focused.

    My hunch on culture is that we’re very kind, down-to-earth people who really care about solving particular problems.

    From a product perspective, we don’t care how we get there. From a modeling perspective, people always ask, ‘What’s your calling card?’ Like, so-and-so has Transformers, this other company has Diffusion.

    Honestly, text-to-speech systems are Frankensteins to begin with. One part might be a Transformer, another part might be Diffusion, some other part might be a state-space model.

    Check out the full interview here!

    — Lily Clifford, Co-Founder and CEO at Rime