Community Playbook

Logo

https://commonvoice.mozilla.org

View the Project on GitHub common-voice/community-playbook

Playbook Pages

🗣 Voice corpus

Our purpose

Contribute and validate our voices clips under public domain licence to generate a dataset usable by Speech to Text technologies to train models in different languages democratizing voice technology.

Who we are

We are a community of voice tech enthusiasts, who want to help collect and generate a large dataset of public domain voices that can be freely used to train Speech to Text technologies.

What’s success

Collect and validate as many voices as possible in our languages. Having more voices validated allows us to then train more advanced STT models.

Data quantities

How to join

🔨 You don’t need any specialized skill to contribute to this community, you only need to be able to speak into a microphone or listen to audio clips.

We have developed a site that allows you to donate your voice by reading sentences collected by the community.

  1. Check if your language is currently available for voice contributions on the languages page.

⚠️ If your language is not launched you might want to contribute through supporting localization or text corpus. In order to have a language enabled on our site, you will need at least 5000 validated sentences.

  1. Learn about how we are using your data by reading our terms of service and privacy notice.

Feel free to create an account to track your progress,

  1. Feel free to create an account to track your progress, compare with other contributors, set yourself goals or get awards badges, and optionally share more information on your profile about your voice.

Help tackle bias in speech data

💬 Demographic information helps us balance the dataset, giving machine learning researchers and engineers a way to train models that represent better the speakers of the language. Balancing the dataset across demographics can help to tackle voice bias voice technology.

  1. If your language is launched you can contribute voice data or validate other people’s voice clips

ℹ️ _ Each recording will need at least two positive validations from different people. Please read the following community guidelines to know how to produce better voice clips.

⚠️ Note: Once you have recorded a decent amount of clips in your language (around 300), it’s more valuable for less effort if you jump into helping to get new voices from other people and focus on the voice validation part, this will increase the dataset quality.

Community mobilization

You can help the community by organizing activities and encouraging others to do the same. Use the channels we have at our disposal to engage with other contributors in your language, talk about your ideas to grow the community and collect and validate more voices.

ℹ️ Check a few ideas from the Contribute to Common Voice activity.

⭐️ You can re-use any graphical material we have produced to support the project.

Community support

Help other contributors in our discourse and matrix channels. Answering their questions about how to use the site or helping document reported issues on github.

Tooling development

The main development of our site is led by our staff team, but anyone can submit pull requests based on open issues, or minor UI bugs.

ℹ️ Please read the contribution guidelines before submitting any code.

Dataset releases

The complete text and voice dataset for languages where we have data is currently generated by the Common Voice staff team.

Currently, we are generating a new version of the datasets two times per year and publishing them on our site.

ℹ️ Note that we are asking for an email to send the link to the dataset (instead of direct download) because we want to have a way to contact everyone who downloaded the data in case we get deletion requests from contributors.

We understand that some people might want more frequent releases, and we are working on a more continuous release model to accommodate these needs.

Roles

These are some roles you can take as part of this community.

Channels