Sarvam AI launches Bulbul V3, wins praise for the Indic text-to-speech model
Per Sarvam cofounder Pratyush Kumar, Bulbul V3 is the fifth of 14 planned launches. In a study, Bulbul V3 topped the charts for 8 kHz audio, setting what Kumar called a new benchmark for speech synthesis for voice agents, with listeners tagging re...

The release has received strong praise from across the AI community. Notably, Deedy Das, partner at Menlo Ventures — which backs companies such as Anthropic — walked back on his earlier criticism of Sarvam. He said he was “wrong” about the startup and added that Sarvam now offers the best text-to-speech, speech-to-text, and optical character recognition (OCR) models for Indic languages, calling the work “really valuable.”
Bulbul V3: About the model
Introducing the model on X, Sarvam cofounder Pratyush Kumar described Bulbul V3 as the fifth of 14 planned launches. “In an independent third-party human listening study, Bulbul V3 delivers the highest listener preference and low error rates across use cases and languages,” he said.
In the following thread, Kumar explained that the model was tested in a blind listening study conducted by independent research partner Josh Talks AI. Listeners compared Bulbul V3 with ElevenLabs (v3 alpha and v2.5 flash) and Cartesia Sonic-3.
What sets Bulbul V3 apart
In a blog post, Sarvam said Bulbul V3 raises the bar across three areas that matter most for real-world speech systems:
- Naturalness: Achieves high listener preference at 48 kHz and ranks as the most preferred model for 8 kHz telephony, outperforming competitors.
- Robustness: Shows low character error rates on difficult inputs such as code-mixing and numerics.
- Stability: Records the fewest word skips and mispronunciations, even in long-form and high-volume usage.
The company said the study covered two test conditions — general full-band audio and 8 kHz telephony-grade audio — to reflect both studio-quality and real-world use. Each language had 50 to 70 annotators, producing around 2,000 votes per language, with more than 500 annotators taking part overall.
In the post, Kumar added that listeners also tagged real failure cases to measure stability. “Bulbul V3 comes out on top, with the lowest average error rates,” he said.
“We also evaluated for the long tail of language challenges, such as speaking numerics, technical content, and named entities. Bulbul V3 consistently has the lowest error rates across languages,” he added.
New voice library
Alongside the model, Sarvam unveiled a new voice library with over 30 professional-quality voices across 11 Indian languages, all recorded by trained voice artists. According to the company, this gives voices greater depth, clarity, and emotional range, especially for long-form audio.
Sarvam said support will soon expand to 22 Indian languages.
In addition, the model also allows voice cloning, enabling custom voices to be created while retaining natural quality. This, the company said, “enables brand-specific voices, consistent character identities, and personalised experiences at scale.”
The Economic Times Business News App for the Latest News in Business, Sensex, Stock Market Updates & More.
The Economic Times News App for Quarterly Results, Latest News in ITR, Business, Share Market, Live Sensex News & More.