🗣 Voice data collection

People across the world contribute voices clips under creative commons zero licence to generate a dataset usable by Voice technologies to train models in different languages democratizing voice technology.

How is my data being used?

Common Voice Dataset has been used to make Welsh Voice assistants, Divehi Text to Speech tools to Kinyarwanda Covid-19 Support Chatbots. 🎬 Learn how Alessio made a Offline voice message transcription.

You can use the dataset too !

For data collected throughout the Common Voice platform please read our terms of service and privacy notice.

How much data do you need to make any voice tech tool?

Techniques such as transfer learning can help leverage existing datasets to help build speech recognition models for languages that lack representation. For example, 77 Hours of Welsh Common Voice data, “transfer learning and a domain-specific language model, acceptable speech recognition” learn more on Dewi’s paper.

Mobilising people and organisation to support the growth of your language can also help to build more advanced STT models.

At least 1,000 unique speakers per language.
2,000 hours of voice validated to train a near-human general STT model.
10,000 hours of voice validated for a very high quality, general, large vocabulary, continuous speech recognition model.

Data quantities

What should I consider when contributing Voice Data?

Validation guidelines where influenced by community contributors, we encourage you to help localise the guidelines to better fit your language needs.

Representation Bias

Speech recognition doesn’t work equally for all demographics. Demographic information helps us balance the dataset, giving machine learning researchers and engineers a way to train models that represent better the speakers of the language. Balancing the dataset across demographics can help to tackle voice bias in voice technology.

🎬 Watch the Break the Bias Workshop

Share on your profile information about your accent, age group, gender and dialect
Take part in the Langauge Variant selection process
Register your interest in being part of the Gender Action Group
Read the Mozilla Kiswahili Project Gender Action Plan

⚠️ _Note: Once you have recorded a decent amount of clips in your language (around 300), it’s more valuable for less effort if you jump into helping to get new voices from other people and focus on the voice validation part, this will increase the dataset quality

Playbook Pages

Localization: Translating project tools and material to be understood by contributors in their language
Text Corpus: Gathering, validating and processing public domain sentences
Voice Corpus: Recording and validating voice clips to create a public domain dataset
Communities: Connect with the variety of communities participating in Common Voice
Mobilization: Resources and tips for mobilizing your community