Microsoft launches 3 AI models for transcription, image, and speech generation

Through these three models — MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 — Microsoft aims to expand its push into multimodal AI capabilities for developers. The models are also being integrated into Microsoft products, including Copilot, Bing, ...

By ETtech | Apr 02, 2026, 09.38 PM IST

Microsoft launches 3 AI models for transcription, image, and speech generation

Microsoft on Thursday announced three new models from its Microsoft AI (MAI) model family for transcription, image, and speech generation.

This includes MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2, as Microsoft aims to expand its push into multimodal artificial intelligence (AI) capabilities for developers.

Starting today, the models are now available on Microsoft Foundry and the MAI Playground. Formerly Azure AI Studio, Foundry is a unified AI platform to build, customise, and scale generative AI (GenAI) applications and agents. Meanwhile, Playground is its public testing environment where users can experiment with features and provide feedback.

“Consistent with our commitment to safe and responsible AI, these MAI models were developed, tested, and rigorously red-teamed. Through Microsoft Foundry, developers get built-in guardrails, governance, and enterprise-grade controls designed to support safe, compliant deployment at scale,” wrote Mustafa Suleyman in a blog post. Suleyman leads the AI division at Microsoft.

MAI-Transcribe-1 is a speech-to-text model that can support transcription across the 25 most widely used languages, including Hindi. According to Microsoft, the model produces fewer mean word errors (WER) than even Google's Gemini 3.1 Flash and OpenAI’s GPT-Transcribe. WER evaluates the accuracy of Automatic Speech Recognition (ASR) systems by measuring the percentage of words a model gets wrong.

The model offers batch transcription speeds up to 2.5 times faster than Microsoft’s existing Azure Fast offering. The starting price of the model is $0.36 per hour.

Meanwhile, using MAI-Voice-1, developers will be able to create custom voices with a few seconds of input audio. The model can generate up to 60 seconds of audio in one second, with pricing starting at $22 per one million characters.

Finally, MAI-Image-2, Microsoft’s latest image generation model, introduced only in the MAI Playground last month, is now broadly accessible via Foundry. The model delivers at least twice the generation speed compared to earlier versions, based on production data, while maintaining output quality. Pricing starts at $5 per one million text tokens and $33 per one million image tokens.

The models are also being integrated into Microsoft products, including Copilot, Bing, and PowerPoint, with enterprise adoption already underway.

Download
The Economic Times Business News App for the Latest News in Business, Sensex, Stock Market Updates & More.

Microsoft launches 3 AI models for transcription, image, and speech generation

Through these three models — MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 — Microsoft aims to expand its push into multimodal AI capabilities for developers. The models are also being integrated into Microsoft products, including Copilot, Bing, ...

READ MORE:

More from our Partners

Popular Categories

Hot on Web

In Case you missed it

Top Searched Companies

Latest News

Download ET APP

Follow us on

become a member