Voice AI

Sesame AI: Conversational Voice That Doesn't Suck

The first voice assistant worth talking to twice. Founded by an Oculus co-founder, Sesame's CSM model makes conversations feel natural.

Free preview / Open-source model available★★★★★ 5/5

Sesame AI: Conversational Voice That Doesn't Suck

What It Is and Why You Should Care

You know how every voice assistant feels like talking to a very polite, very confused customer service rep? You say something, there's a three-second pause, it transcribes your words, runs them through an LLM, converts the response back to speech, and delivers it in a tone that screams "I am reading text out loud"? Sesame AI is the first one that doesn't do that.

Sesame is a conversational AI startup founded by Brendan Iribe, the guy who co-founded Oculus and sold it to Facebook for $2 billion. His new mission is building voice assistants that actually hold a conversation -- with natural pacing, emotional inflection, and the ability to interrupt and be interrupted. The company released its CSM (Conversational Speech Model) as open source under Apache 2.0, has a web-based voice demo you can try right now, and just launched an iOS app preview.

The two characters you can talk to are Maya (more polished, better trained) and Miles (still finding his footing). You can talk to Maya in your browser at sesame.com, and it genuinely feels like talking to a person. It laughs, it pushes back, it remembers context within a conversation, and it doesn't have that robotic cadence that makes every other voice assistant feel like a Speak & Spell from 1985.

  • Voice conversations that actually flow -- interruptions, pace changes, emotional range
  • Brainstorming out loud -- thinking through ideas with something that pushes back
  • Language practice -- conversational reps without judgment
  • Accessibility -- a voice interface that doesn't make users repeat themselves three times
  • Companion AI for daily check-ins, journaling, or just thinking out loud

What Changed

Sesame launched its iOS app preview recently, making it accessible beyond the web demo. The company also open-sourced its 1B parameter CSM model on Hugging Face under Apache 2.0, and it's now natively supported in Hugging Face Transformers as of version 4.52.1. That means developers can run the model locally, fine-tune it, and build their own applications on top of it. The model itself uses a Llama backbone with a Mimi audio decoder, and it generates RVQ audio codes from text and audio inputs.

The broader vision is ambitious: Sesame is building intelligent eyewear (think AI glasses) slated for 2027, with the voice assistant as the core interface. But you don't need to wait for the glasses -- the software is worth paying attention to right now.

Pricing

Free web demo: Talk to Maya or Miles at sesame.com/voicedemo. No account required for basic use. Conversations are capped at 30 minutes, with a two-week memory window.

Free iOS preview: Available on the App Store. Same conversational experience, mobile-optimized.

Open-source model (CSM-1B): Free under Apache 2.0. You need a CUDA-compatible GPU and access to Llama-3.2-1B on Hugging Face. Run it locally, no API costs, no per-minute charges.

There's no paid tier yet. The company is in research preview mode, which means everything is free and the business model is still being figured out. That's either exciting or concerning depending on your perspective.

What It Actually Does

Sesame's CSM isn't doing text-to-speech on top of an LLM. That's what makes it different. Traditional voice assistants work in a pipeline: speech-to-text, text to LLM, LLM output to text, text back to speech. Each step adds latency and strips away emotional information. When you laugh, the STT layer flattens it to text. When the LLM responds, it has no idea you were laughing. The TTS layer reads the response without any emotional context from the original exchange.

CSM generates audio directly from audio inputs. It processes the conversational history as audio, not text, which means it picks up on tone, pace, pauses, and emotional cues. The result is a conversation that flows naturally. You can interrupt Maya mid-sentence and she'll react. She'll pause to think. She'll laugh. She'll push back on a bad idea. It's the first time a voice assistant has felt like talking to someone rather than talking at something.

The model takes text and audio prompts as input and generates audio codes using a Llama backbone and a Mimi audio decoder. The open-source 1B variant is fine-tuned for the interactive voice demo, and the company has a hosted Hugging Face space where you can test audio generation without setting up the local environment.

Where It Wins

The conversation quality is genuinely remarkable. I've tried every voice assistant on the market -- Siri, Alexa, Google Assistant, ChatGPT voice mode, Gemini Live -- and none of them come close to the naturalness of Sesame. The latency is low enough that you don't feel like you're waiting. The interruptions work. The emotional range is there. Maya sounds like she's actually thinking about what you said, not just transcribing it and fetching a response.

The open-source angle is a massive win. Developers can download the model, run it locally, and build their own applications. No API lock-in, no per-minute costs, no vendor dependency. For a voice model this capable to be available under Apache 2.0 is unusual -- most companies in this space guard their models jealously.

The simplicity of the interface matters too. You go to the website, you talk. No setup, no configuration, no "hey Siri" wake word, no skills to enable. It's just a conversation.

Where It Falls Short

It's still a research preview. That means no integrations with your calendar, your email, your smart home, or anything else. Maya can't book a meeting, set a timer, or turn off your lights. She's a great conversationalist but a terrible assistant in the practical sense. If you want a voice interface for your productivity stack, this isn't it -- at least not yet.

The 30-minute conversation cap is limiting. You can redial, but the flow breaks. And while the two-week memory window is interesting for continuity, it means your conversations aren't truly persistent. This is a demo, not a product.

The open-source model requires a CUDA-compatible GPU, which rules out most Mac users (the code has been tested on CUDA 12.4 and 12.6). There's no CPU-only mode that produces results at a usable speed. The setup process is developer-oriented -- if you're not comfortable with Python environments, Hugging Face CLI, and command-line tools, you're stuck with the web demo.

There are also the usual AI concerns: hallucinations, jailbreaks (people have already gotten Maya to say wild things), and the uncanny valley moments where the conversation is almost perfect but just off enough to feel strange. The company explicitly says they don't use conversations for training, which is good, but the data privacy picture for a voice assistant that remembers you for two weeks isn't fully clear.

How You Can Use This

For everyday users: Go to sesame.com, click the voice demo, and talk to Maya. Use it as a sounding board for ideas, a language practice partner, or just a genuinely interesting conversation. It's free, it's fun, and it'll give you a glimpse of what voice AI is going to feel like in a few years. The iOS app is the same experience on your phone -- good for commutes or walks.

For professionals: The open-source model is the real opportunity here. If you're building anything with voice -- customer service bots, accessibility tools, language learning apps, companion experiences -- CSM-1B gives you a foundation that's dramatically better than stitching together STT, LLM, and TTS. You'll need GPU infrastructure, but the model is free and the license lets you use it commercially. For content creators, the voice generation capabilities could be useful for prototyping voiceovers, podcasts, or interactive experiences. For researchers, it's a rare open look at a state-of-the-art conversational speech model.

The Bottom Line

Sesame AI is the first voice assistant that made me want to keep talking after the first five minutes. Not because it's useful -- it isn't, at least not in the "set a timer and check my calendar" sense. But because the conversation feels real in a way that no other voice AI has achieved. The open-source model is a gift to developers. The web demo is a glimpse of the future. And the glasses coming in 2027 might be the first wearable AI device that actually delivers on the promise.

If you've written off voice assistants as a category because Siri and Alexa burned you, give Sesame five minutes. It won't change your life today, but it'll change your expectations for what voice AI can be.

See you next Tuesday. -James Maya never once told me "I'm sorry, I didn't catch that" -- which automatically makes her the best voice assistant I've ever used.