Launching Arcana v2: Upgraded Vibes with More Languages and On-Prem ✨

    Rime just made the most realistic TTS model even more human and expressive.

    At Rime, we're on a mission to make voice AI feel more human. Our deep linguistic and machine learning expertise helps us bridge the gap between humans and machines, driving millions of dollars in ROI for enterprise-scale voice agents.

    Today we’re introducing Arcana v2, our next-generation text-to-speech. It's the most human speech synthesis model yet. Better, faster, and more expressive than the boundary-pushing Arcana v1.

    We're super excited to share Arcana v2!

    Visit Rime Try to talk to our voices for free in your browser. 

    Unlike some unnamed competitors, Arcana v2 is available via API from day one. 

    With 300+ total voices and 35 flagship voices—18 English, 4 Spanish, 3 natively bilingual English/Spanish, 5 French, and 5 German—you’ve got options, and they’re all API-accessible.

    Plus it's easy to instantly code switch between English, Spanish, and Spanglish. Just like a native speaker would.

    Deploy via cloud, VPC, or on-prem. Stream via https or websockets with various audio formats, including telephony formats.

    Behind the scenes, we’ve made a bunch of architectural improvements, which will let us fast follow with tons more voices, languages, and workflow tools. Look out for Hindi, Arabic, Japanese, Hebrew, Korean, Portuguese, and many more coming soon.

    With Rime’s voices, companies are driving real impact today: 

    We’re motivated by the belief that it shouldn’t suck when you call your cellular carrier or bank.

    Alright, let’s dive in!

    Hear Voices

    But first, don’t you want to hear how our new voices sound?

    Listen to audio00:00
    Hey there, this is Mike from Souther Roofing & Repair. Yeah, we've been swamped with calls this hurricane season...
    Listen to audio00:00
    This weekend vamos a San Francisco to explore la ciudad. I'm thinking un poquito de hiking...
    Rime Try: live use case demo

    If you’d like to have a full conversation we’ve got some options:


    Since Arcana v2 is a direct 1:1 improvement on Arcana v1, we’ve kept the model API parameter the same (just arcana). So if you’re already building with Arcana, you don’t have to do anything extra. You’re already getting the upgraded voices!

    curl -X POST https://users.rime.ai/v1/rime-tts \
    -H 'Accept: audio/mp3' \
    -H 'Authorization: Bearer $RIME_API_KEY' \
    -H 'Content-Type: application/json' \
    -d '{
    "speaker": "luna",
    "text": "Oh hey there! <laugh> This is Luna. I sound even better than I did in v1!",
    "modelId": "arcana"
    }'

    (Psst. Here are two forkable open source Rime+LiveKit and Rime+Pipecat repos if you’d like to hack on the demo locally too.)

    Unmatched Realism

    Human-level realism for your application

    From real-time customer conversations to crystal-clear narration, English output is now sharper, smoother, and far more human. This is the kind of upgrade you can hear instantly.

    Listen to audio00:00
    Hello, thank you for contacting Acme Corp Support. My name is Astra, how can I assist you today?

    Plus here are some fun examples of paralinguistic features (nonverbal aspects of communication) that add authenticity and realism to the Arcana speech synthesis model. Shoot, I'm tellin' ya'll, ain't nobody else doing stuff like this!

    Listen to audio00:00
    small vocal noise: 'tsk'
    Listen to audio00:00
    false starts, stuttering
    Listen to audio00:00
    boom!
    Listen to audio00:00
    small glottal stop between first two words
    Listen to audio00:00
    breathwork
    Listen to audio00:00
    he didn't even bother to buy her a drink?!

    While competitors train their models on audiobooks, podcasts, and youtube data, which sound overly polished and have a performative "influencer" accent, Rime is training on real conversations with everyday people. When your customers hear our voices, they have an intuitive sense that these are real, so their willingness to talk to the bot increases. That means increased revenue and reduced costs through improved customer satisfaction, call success, containment, etc.

    Multi-Lingual Code Switching

    ¡naipes out!

    Do you speak Spanglish? Glide between Spanish and English mid-sentence without a stumble. Fluent, natural, expressive bilingual speech because language isn’t always one thing at a time.

    We also support French+English and German+English multi-lingual code switching.

    Listen to audio00:00
    Ich habe die Präsentation fürs team meeting fertig aber der final draft muss noch korrigiert werden, und der deadline pressure macht mich total fertig, verstehst du?

    Enterprise Deploy Options

    Deploy Rime with maximum scale and minimum latency

    Arcana v2 lets you run on-prem or via websockets, with the flexibility to use your own GPUs.

    Run Arcana entirely inside your own walls. No cloud hops, no external dependencies. Just raw, expressive voice generation under your control. Enterprise-grade privacy meets full creative power.

    With websockets, stream voice in real time with ultra-low latency, no buffering, no dead air.

    If you’re already using Arcana on-prem, please contact sales@rime.ai or your primary contact and we’ll share the Arcana v2 images with you.

    Technical Report: Model Architecture

    Super scientific schema

    Arcana is a multimodal, autoregressive text-to-speech (TTS) model that generates discrete audio tokens from text inputs. These tokens are decoded into high-fidelity speech using a novel codec-based approach, enabling faster-than-real-time synthesis.

    • Backbone: A large language model (LLM) trained on extensive text and audio data.

    • Audio Codec: Utilizes a high-resolution codec to capture nuanced acoustic details.

    • Tokenization: Employs a discrete tokenization process to represent audio features effectively.

    • Training: Pre-trained from scratch and fine-tuned on Rime’s proprietary in-studio conversational data (more on that below).

    Arcana is designed for smooth, rhythmic delivery in any language.

    Rime’s Data Advantage

    Rime’s proprietary recording studio in the heart of San Francisco

    Rime’s unmatched voice realism comes from the world’s largest and most richly labeled proprietary dataset of natural, full-duplex speech, where two speakers talk naturally and sometimes overlap. This captures the true rhythm, spontaneity, and emotional texture of real human conversation.

    Our PhD-level annotators use sociolinguistically sophisticated techniques to capture accents, idiolects, filler words, pauses, stress patterns, and multi-lingual code-switching. Every utterance is meticulously labeled with demographic and conversational context, giving Arcana v2 a deep understanding of how people really speak.

    This dataset powers a three-stage training process—pre-training on large text-audio corpora, supervised fine-tuning on our proprietary recordings, and speaker-specific optimization for conversational reliability. The result is a model that can summon infinite voices, each as vivid and expressive as the people they’re inspired by.

    For more information on Rime’s data practices, check out the original Arcana announcement.

    Looking Ahead 👀

    Insane levels of optimism

    We’re continuing to expand Arcana’s linguistic range and expressive depth. Our custom pronunciation dictionary will grow to capture even more names, places, and specialized terminology, ensuring flawless delivery in every context. And with highly anticipated additions like Hindi, Japanese, and Arabic on the horizon, Arcana will soon bring its natural, human-like conversation to an even broader global audience.

    Stay tuned as all of us at Rime keep advancing the frontier of voice AI!

    Launching Arcana v2: Upgraded Vibes with More Languages and On-Prem ✨ | Rime