Gemini 3.1 Flash TTS
Google's latest Text-to-Speech (TTS)) model (April 2026), part of the Gemini family. Focused on controllability, expressivity, and quality for developers and enterprises building speech applications. Complements Gemini 3.1 Flash Live (real-time dialogue) with higher-quality, more controllable offlin
This is a note from my public notes. View the canonical version: Gemini 3.1 Flash TTS.
Google's latest Text-to-Speech (TTS) model (April 2026), part of the Gemini family. Focused on controllability, expressivity, and quality for developers and enterprises building speech applications. Complements Gemini 3.1 Flash Live (real-time dialogue) with higher-quality, more controllable offline generation.
Key Capabilities
- Audio tags for style control: natural-language inline tags (e.g.
[whispers],[laughs],[excited],[sighs],[sarcastic],[very fast]) give granular control over delivery, tone, pace, and non-verbal sounds. Tags can be combined and mixed mid-sentence; no exhaustive list, experimentation encouraged. - Creative expressivity: supports stylistic directives like
[like a cartoon dog]or[like dracula]; scene direction and speaker-level audio profiles. - Multi-speaker dialogue: native support for multi-speaker conversations with distinct voices.
- Broad language coverage: 70+ languages. For non-English transcripts, English tags recommended.
- Quality: 1,211 Elo on the Artificial Analysis TTS leaderboard; positioned in the attractive quadrant for quality-vs-cost.
- SynthID watermarking: all generated audio is invisibly watermarked to flag AI-generated content.
Availability (April 15, 2026)
- Developers: preview via Gemini API and Google AI Studio (configurable controls with exportable API code).
- Enterprises: preview on Google Vertex AI.
- Consumers: available via Google Vids for Workspace users.
Notable Observations
- Simon Willison flagged the prompting guide as "surprising" — effective prompts can span hundreds of words, specifying accent, emotional shading, even "the grin in the audio".
- Accent control is prompt-driven; switching between UK regions (London, Newcastle, Exeter) in the same base prompt produces distinct regional deliveries.
- Willison built an interactive playground at tools.simonwillison.net for multi-speaker experimentation.
Why It Matters
Moves TTS from "read this text" to "perform this text". Unlocks voice-first agents, character-driven audio, multi-speaker narration, and podcast-style content without manual voice acting; also accelerates plausible audio impersonation (mitigated partially by SynthID).
References
- Announcement: https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-tts/
- Transcript tags docs: https://ai.google.dev/gemini-api/docs/speech-generation#transcript-tags
- Simon Willison's notes: https://simonwillison.net/2026/Apr/15/gemini-31-flash-tts/
- Playground: https://tools.simonwillison.net/
Related
About Sébastien
I'm Sébastien Dubois, and I'm on a mission to help knowledge workers escape information overload. After 20+ years in IT and seeing too many brilliant minds drowning in digital chaos, I've decided to help people build systems that actually work. Through the Knowii Community, my courses, products & services and my Website/Newsletter, I share practical and battle-tested systems.
I write about Knowledge Work, Personal Knowledge Management, Note-taking, Lifelong Learning, Personal Organization, Productivity, and more. I also craft lovely digital products and tools.
If you want to follow my work, then become a member and join our community.
Ready to get to the next level?
If you're tired of information overwhelm and ready to build a reliable knowledge system:
- 📚 KM for Beginners — 10+ hours of structured video lessons
- 🚀 Obsidian Starter Kit — Ready-made vault with 40+ templates
- 💼 Knowledge Worker Kit — Complete guides + lifetime community
- 🦉 1-on-1 Coaching — Personalized guidance
- 🎯 Join Knowii — Community + ALL courses & tools
Found this valuable? Share it with someone who needs it.