Microsoft launches 3 AI models for transcription, image, and speech generation

Through these three models — MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 — Microsoft aims to expand its push into multimodal AI capabilities for developers. The models are also being integrated into Microsoft products, including Copilot, Bing, ...

Microsoft launches 3 AI models for transcription, image, and speech generation
Microsoft on Thursday announced three new models from its Microsoft AI (MAI) model family for transcription, image, and speech generation.

This includes MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2, as Microsoft aims to expand its push into multimodal artificial intelligence (AI) capabilities for developers.

Starting today, the models are now available on Microsoft Foundry and the MAI Playground. Formerly Azure AI Studio, Foundry is a unified AI platform to build, customise, and scale generative AI (GenAI) applications and agents. Meanwhile, Playground is its public testing environment where users can experiment with features and provide feedback.


“Consistent with our commitment to safe and responsible AI, these MAI models were developed, tested, and rigorously red-teamed. Through Microsoft Foundry, developers get built-in guardrails, governance, and enterprise-grade controls designed to support safe, compliant deployment at scale,” wrote Mustafa Suleyman in a blog post. Suleyman leads the AI division at Microsoft.

MAI-Transcribe-1 is a speech-to-text model that can support transcription across the 25 most widely used languages, including Hindi. According to Microsoft, the model produces fewer mean word errors (WER) than even Google's Gemini 3.1 Flash and OpenAI’s GPT-Transcribe. WER evaluates the accuracy of Automatic Speech Recognition (ASR) systems by measuring the percentage of words a model gets wrong.

The model offers batch transcription speeds up to 2.5 times faster than Microsoft’s existing Azure Fast offering. The starting price of the model is $0.36 per hour.
ADVERTISEMENT

Meanwhile, using MAI-Voice-1, developers will be able to create custom voices with a few seconds of input audio. The model can generate up to 60 seconds of audio in one second, with pricing starting at $22 per one million characters.

Finally, MAI-Image-2, Microsoft’s latest image generation model, introduced only in the MAI Playground last month, is now broadly accessible via Foundry. The model delivers at least twice the generation speed compared to earlier versions, based on production data, while maintaining output quality. Pricing starts at $5 per one million text tokens and $33 per one million image tokens.

The models are also being integrated into Microsoft products, including Copilot, Bing, and PowerPoint, with enterprise adoption already underway.
Download
The Economic Times Business News App
for the Latest News in Business, Sensex, Stock Market Updates & More.
Download
The Economic Times News App
for Quarterly Results, Latest News in ITR, Business, Share Market, Live Sensex News & More.
READ MORE
ADVERTISEMENT

READ MORE:

LOGIN & CLAIM

50 TIMESPOINTS

More from our Partners

Loading next story
Business News › Tech › Tech & Internet › Microsoft launches 3 AI models for transcription, image, and speech generation
Text Size:AAA
Success
This article has been saved

*

+