news

Gemini 3.1 Flash TTS

Sebastien Dubois

16 Apr 2026 — 1 min read

This is a note from my public notes. View the canonical version: Gemini 3.1 Flash TTS.

Google's latest Text-to-Speech (TTS) model (April 2026), part of the Gemini family. Focused on controllability, expressivity, and quality for developers and enterprises building speech applications. Complements Gemini 3.1 Flash Live (real-time dialogue) with higher-quality, more controllable offline generation.

Key Capabilities

Audio tags for style control: natural-language inline tags (e.g. [whispers], [laughs], [excited], [sighs], [sarcastic], [very fast]) give granular control over delivery, tone, pace, and non-verbal sounds. Tags can be combined and mixed mid-sentence; no exhaustive list, experimentation encouraged.
Creative expressivity: supports stylistic directives like [like a cartoon dog] or [like dracula]; scene direction and speaker-level audio profiles.
Multi-speaker dialogue: native support for multi-speaker conversations with distinct voices.
Broad language coverage: 70+ languages. For non-English transcripts, English tags recommended.
Quality: 1,211 Elo on the Artificial Analysis TTS leaderboard; positioned in the attractive quadrant for quality-vs-cost.
SynthID watermarking: all generated audio is invisibly watermarked to flag AI-generated content.

Availability (April 15, 2026)

Developers: preview via Gemini API and Google AI Studio (configurable controls with exportable API code).
Enterprises: preview on Google Vertex AI.
Consumers: available via Google Vids for Workspace users.

Notable Observations

Simon Willison flagged the prompting guide as "surprising" — effective prompts can span hundreds of words, specifying accent, emotional shading, even "the grin in the audio".
Accent control is prompt-driven; switching between UK regions (London, Newcastle, Exeter) in the same base prompt produces distinct regional deliveries.
Willison built an interactive playground at tools.simonwillison.net for multi-speaker experimentation.

Why It Matters

Moves TTS from "read this text" to "perform this text". Unlocks voice-first agents, character-driven audio, multi-speaker narration, and podcast-style content without manual voice acting; also accelerates plausible audio impersonation (mitigated partially by SynthID).

References

Announcement: https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-tts/
Transcript tags docs: https://ai.google.dev/gemini-api/docs/speech-generation#transcript-tags
Simon Willison's notes: https://simonwillison.net/2026/Apr/15/gemini-31-flash-tts/
Playground: https://tools.simonwillison.net/

Gemini 3.1 Flash TTS

Sebastien Dubois

Key Capabilities

Availability (April 15, 2026)

Notable Observations

Why It Matters

References

About Sébastien

Ready to get to the next level?

Free: Knowledge System Checklist

Key Capabilities

Availability (April 15, 2026)

Notable Observations

Why It Matters

References

Related

About Sébastien

Ready to get to the next level?

Free: Knowledge System Checklist