Project Vaani

December 26, 2022

Project Vaani will be implemented jointly by the Indian Institute of Sciences (IISc), ARTPARK (AI and Robotics Technology Park), and Google to gather speech data from across India for the creation of an AI-based language model that can understand diverse Indian languages and dialects.

What is Project Vaani?

Under Project Vaani, diverse languages used across India will be mapped by collecting speech sets of around 1 million people from 773 districts over 3 years.
The estimated cost of this project is around 30 to 40 million USD.
It is part of the Bengaluru-based IISc and Artpark’s Bhasha AI project that includes RESPIN (Recognizing Speech in Indian languages) and SYSPIN (Synthesizing Speech in Indian languages).
The project would involve IISc and Google recording around 1.5 lakh hours of speech, part of which will be transcribed in local scripts.
This project uses a district-anchored approach, which involves recording local speeches by randomly selecting over 1,000 people from each district.

What are the objectives of the initiative?

One of the main objectives of this project is the development of technologies like automatic speech recognition, speech-to-speech translation and natural language understanding.
Its ultimate goal is to deliver a technological solution that can eliminate the linguistic barriers that are currently present in technology and increase accessibility of the technology for a wider range of people.
Once this project is fully completed, efforts will be taken to create an artificial intelligence-based language model that can understand diverse languages and dialects used in India.
The new model proposed under the Vaani project supports both speech and text translation. This would be a leap from the Multilingual Representations for Indian Languages (MuRIL), which only supports text-based translation. The new model would be trained on speech and text from over 100 Indian languages, which are spoken by over 1 lakh people across India.

What is the current status of the project?

Over the past few months, linguistic data from nearly 69 districts have been collected from across India.
So far, over 150 hours of data have been collected, covering more than 30 languages from 841 different pin codes in a gender and age-balanced manner.

« Previous Post Next Post »