Gemini 3.1 Flash TTS

Google's latest Text-to-Speech (TTS)) model (April 2026), part of the Gemini family. Focused on controllability, expressivity, and quality for developers and enterprises building speech applications. Complements Gemini 3.1 Flash Live (real-time dialogue) with higher-quality, more controllable offlin

This is a note from my public notes. View the canonical version: Gemini 3.1 Flash TTS.

Google's latest Text-to-Speech (TTS) model (April 2026), part of the Gemini family. Focused on controllability, expressivity, and quality for developers and enterprises building speech applications. Complements Gemini 3.1 Flash Live (real-time dialogue) with higher-quality, more controllable offline generation.

Key Capabilities

  • Audio tags for style control: natural-language inline tags (e.g. [whispers], [laughs], [excited], [sighs], [sarcastic], [very fast]) give granular control over delivery, tone, pace, and non-verbal sounds. Tags can be combined and mixed mid-sentence; no exhaustive list, experimentation encouraged.
  • Creative expressivity: supports stylistic directives like [like a cartoon dog] or [like dracula]; scene direction and speaker-level audio profiles.
  • Multi-speaker dialogue: native support for multi-speaker conversations with distinct voices.
  • Broad language coverage: 70+ languages. For non-English transcripts, English tags recommended.
  • Quality: 1,211 Elo on the Artificial Analysis TTS leaderboard; positioned in the attractive quadrant for quality-vs-cost.
  • SynthID watermarking: all generated audio is invisibly watermarked to flag AI-generated content.

Availability (April 15, 2026)

  • Developers: preview via Gemini API and Google AI Studio (configurable controls with exportable API code).
  • Enterprises: preview on Google Vertex AI.
  • Consumers: available via Google Vids for Workspace users.

Notable Observations

  • Simon Willison flagged the prompting guide as "surprising" — effective prompts can span hundreds of words, specifying accent, emotional shading, even "the grin in the audio".
  • Accent control is prompt-driven; switching between UK regions (London, Newcastle, Exeter) in the same base prompt produces distinct regional deliveries.
  • Willison built an interactive playground at tools.simonwillison.net for multi-speaker experimentation.

Why It Matters

Moves TTS from "read this text" to "perform this text". Unlocks voice-first agents, character-driven audio, multi-speaker narration, and podcast-style content without manual voice acting; also accelerates plausible audio impersonation (mitigated partially by SynthID).

References


About Sébastien

I'm Sébastien Dubois, and I'm on a mission to help knowledge workers escape information overload. After 20+ years in IT and seeing too many brilliant minds drowning in digital chaos, I've decided to help people build systems that actually work. Through the Knowii Community, my courses, products & services and my Website/Newsletter, I share practical and battle-tested systems.

I write about Knowledge Work, Personal Knowledge Management, Note-taking, Lifelong Learning, Personal Organization, Productivity, and more. I also craft lovely digital products and tools.

If you want to follow my work, then become a member and join our community.

Ready to get to the next level?

If you're tired of information overwhelm and ready to build a reliable knowledge system:

Found this valuable? Share it with someone who needs it.

Join 6,000+ readers. Get practical systems for knowledge & AI. Free.

Subscribe ✨

Free: Knowledge System Checklist

A clear roadmap to building your own knowledge system. Subscribe and get it straight to your inbox.

6,000+ readers. No spam. Unsubscribe anytime.