🗣 Voice data collection

People across the world contribute voices clips under creative commons zero licence to generate a dataset usable by Voice technologies to train models in different languages democratizing voice technology.

How is my data being used?

Common Voice Dataset has been used to make Welsh Voice assistants, Divehi Text to Speech tools to Kinyarwanda Covid-19 Support Chatbots. 🎬 Learn how Alessio made a Offline voice message transcription.

You can use the dataset too !

For data collected throughout the Common Voice platform please read our terms of service and privacy notice.

How much data do you need to make any voice tech tool?

Techniques such as transfer learning can help leverage existing datasets to help build speech recognition models for languages that lack representation. For example, 77 Hours of Welsh Common Voice data, “transfer learning and a domain-specific language model, acceptable speech recognition” learn more on Dewi’s paper.

Mobilising people and organisation to support the growth of your language can also help to build more advanced STT models.

Data quantities

What should I consider when contributing Voice Data?

Validation guidelines where influenced by community contributors, we encourage you to help localise the guidelines to better fit your language needs.

Representation Bias

Speech recognition doesn’t work equally for all demographics. Demographic information helps us balance the dataset, giving machine learning researchers and engineers a way to train models that represent better the speakers of the language. Balancing the dataset across demographics can help to tackle voice bias in voice technology.

🎬 Watch the Break the Bias Workshop

⚠️ _Note: Once you have recorded a decent amount of clips in your language (around 300), it’s more valuable for less effort if you jump into helping to get new voices from other people and focus on the voice validation part, this will increase the dataset quality

