AI4Bharat to collect ten trillion tokens of data to power AI in Indian languages
“Several startups, academic institutes and deeptech institutes are using this data to build their own models to accelerate the adoption of language technologies” said Mitesh Khapra, cofounder of AI4Bharat.

Tokens are the basic building blocks that AI uses to understand language. They are usually parts of words or sometimes whole words.
“We have 200 million spoken words… four states where it is already live or in an active stage. We have use cases supporting farmers, children, digital payments and agriculture. Over the past three years, we have gone to almost every district in the country where we’ve tried to cover almost all the 22 official languages of the land,” Khapra said at the People+ai Mela in Bengaluru on Saturday.
AI4Bharat has ensured that it collects voice samples split across several demographics, across different professions, blue collar and white collar, he said, adding, “Several startups, academic institutes and deeptech institutes are using this data to build their own models to accelerate the adoption of language technologies.”
The tools required for data collection have been built from the ground up, according to Khapra. “Our data, models and scripts are open sourced. You can build on top of that,” he said.
Ten trillion token project
To serve India’s diverse population, AI needs to understand Indian languages as well as it understands English, people+ai noted in its blog.
Building AI that works for India requires something different than what works in English. English data is everywhere on the internet, making it easy to train AI models.
India has 22 major languages, each with its own script, grammar rules and cultural context and the current AI approaches simply don't work well enough for this diversity, people+ai’s website said.
It further said, “That's why we started the ‘Ten Trillion Token’ project. We’re building the foundation for AI that can properly understand and work with Indian languages – from formal government documents to casual conversations at the local tea shop. Our goal is to collect and organise the massive amount of data needed to make AI work well for everyone in India, no matter what language they speak.”
The Economic Times Business News App for the Latest News in Business, Sensex, Stock Market Updates & More.
The Economic Times News App for Quarterly Results, Latest News in ITR, Business, Share Market, Live Sensex News & More.