Community Playbook

Logo

https://voice.mozilla.org

View the Project on GitHub Common-Voice/community-playbook

👥 📕 Mozilla Voice Community Playbook V1.1

Mozilla Voice communities empower the collection of machine-learning based voice technologies – including software, tools, and data – that Mozilla stands behind.

Goals for this playbook

Participation Guidelines

Mozilla Voice communities are governed by Mozilla’s code of conduct and etiquette guidelines, we take this very seriously and no violations are tolerated.

We encourage you to please read Mozilla Community Participation Guidelines before contributing to this project.

For more information on how to report violations of the Community Participation Guidelines, please read our ‘How to Report’ page.

Governance

Mozilla Corporation owns the overall Mozilla Voice project governance and is the ultimate decision maker for its direction and goals. It’s also in charge of the development of some tools and channels described here to support our communities.

The voice communities are self-organized, and you don’t need to ask for permission to participate or mobilize any of these communities in your language. All the data generated by communities is published under open licences.

Some community roles exist formally and informally, and they all should follow the Mozilla leadership shared agreements.

The Voice communities

Mozilla Voice has a variety of communities that support the project in different important areas, they are usually grouped by language.

The work done by these communities advance a language from not having a presence in Mozilla Voice at all to being able to generate a functional speech to text (STT) model which is able to understand how people speak.

Voice journey quantities

✈️ The journey involves collect and validate public domain sentences (1), record and validate voices reading the sentences (2), repeat to grow the size of the data (3), generate a dataset (4) and use machine learning to train speech to text models using this dataset (5).

📝 Text corpus

Gathering, validating and processing public domain sentences.
Purpose - Who we are - Success - How to join - What we do - Roles - Channels

🗣 Voice corpus

Recording and validating voices to create a public domain dataset.
Purpose - Who we are - Success - How to join - What we do - Roles - Channels

🌍 Localization

Adapting the project tools and materials to be understood by a specific audience.
Purpose - Who we are - Success - How to join - What we do - Roles - Channels

🤖 Model training (TBD)

Using our text and voice datasets to train and optimize STT models in specific languages using machine learning.

👥 As you can read on this playbook, you will need a multidisciplinary team of committed people to support your language journey.

🔨 Make sure you check the required skills for each section and look for people who can fit.

💬 Check the channels section to learn how to set up your local forums and chat to communicate with other people in your language.

ℹ️ Note: Mozilla’s focus is to optimize this project, tools and communities for the goals and measures of success described in this document. We welcome small and minority language communities, and we understand these goals may seem out of reach. In that case, feel free to share with us how they are different for you. Nevertheless, we welcome all language communities!

📝 Text corpus

Our purpose

Collect or generate text corpus under public domain licence that can be read by people to facilitate their voice donations.

Who we are

We are a community of text collectors and creators, always looking for places with text corpora we can extract and process so it can be transformed into short and simple sentences for people to read.

What’s success

Generate as many sentences as possible in our languages. Having more sentences allows contributors to donate more hours of voice data.

⚠️ You will need at least 5000 validated sentences to have your language enabled for voice contributions on our voice collection site.

How to join

Anyone can join this community. Join our discourse forums or our matrix chat, introduce yourself and jump into our sentence tools right away.

What we do

Sentence extraction

We have developed a tool to extract sentences from large sources of public domain text, with a focus easy-to-read corpus and Wikipedia.

This is the easiest and fastest way to get more than a million sentences as soon as possible for your language.

ℹ️ Please read the tool documentation on how to generate specific rules for your language.

⚠️ Important: Due to legal reasons Mozilla needs to be the one running the final extraction, so please don’t do any manual processing to the resulting extraction during your tests. We can apply manual clean-up after the final version is generated by Mozilla.

🔨 Skills required to help: Command line usage and git, familiar with regular expressions.

Sentence collection

We have also created a sentence collection tool that allows contributors to collect and validate sentences created by the community. You can use this tool also to import and clean-up small-to-medium-sized public domain corpus you have found or collected.

ℹ️ Please read the collector how-to before using this tool and check the community guidelines on how to validate sentences.

🔨 Skills required to help: Strong grammar knowledge of the target language you are contributing to.

Large corpus validation

If you have found an existing public domain corpus bigger than 100K sentences, we have an independent process to handle it, since we understand that manual validation using the sentence collector is not ideal.

ℹ️ Please create a new topic on our discourse, so we can evaluate if your corpus fits the licence and size requirements to run this process.

🔨 Skills required to help: Expertise processing and cleaning up text, linguistics/language expertise to check the quality of the resulting sentences.

Tooling development

Contributors also develop, maintain and update the sentence extractor and collector code.

Roles

These are some roles you can take as part of this community.

Channels

💬 If your language already exists on Common Voice, make sure you check and join the local discourse and matrix room. If that’s not the case, please create a new topic on discourse asking for one to be created.

🗣 Voice corpus

Our purpose

Donate and validate our voices under public domain licence to generate a dataset usable by Speech to Text technologies to train models in different languages democratizing voice technology.

Who we are

We are a community of voice tech enthusiasts, who want to help collect and generate a large dataset of public domain voices that can be freely used to train Speech to Text technologies.

What’s success

Collect and validate as many voices as possible in our languages. Having more voices validated allows us to then train more advanced STT models.

Data quantities

How to join

Anyone can join this community. Join our discourse forums or our matrix chat and introduce yourself, jump into Common Voice site, get familiar with it and start donating your voice.

🔨 You don’t need any specialized skill to contribute to this community, you only need to be able to speak into a microphone or listen to audio clips.

What we do

⚠️ In order to have a language enabled on our site, you will need at least 5000 validated sentences, see previous section about text corpus for reference.

Voice donation

We have developed a site that allows you to donate your voice by reading sentences collected by the community.

Feel free to create an account to track your progress and add more information on your profile about your voice. Demographic information helps us balance the dataset, giving machine learning researchers and engineers a way to train models that represent better the speakers of the language.

ℹ️ Please read the following community guidelines to know how to produce better voice donations.

⚠️ Note: Once you have recorded a decent amount of clips in your language (around 300), it’s more valuable for less effort if you jump into helping to get new voices from other people and focus on the voice validation part, this will increase the dataset quality.

Voice validation

The same site allows you to review other people’s voices by listening to voices donated by the community. Each recording will need at least two positive validations from different people. Feel free to create an account to track your progress, compare with other contributors, set yourself goals or get awards badges.

ℹ️ Please read the following community guidelines to know how to better validate voices.

Community mobilization

You can help the community by organizing activities and encouraging others to do the same. Use the channels we have at our disposal to engage with other contributors in your language, talk about your ideas to grow the community and collect and validate more voices.

ℹ️ Check a few ideas from the Contribute to Common Voice activity.

⭐️ You can re-use any graphical material we have produced to support the project.

Community support

Help other contributors in our discourse and matrix channels. Answering their questions about how to use the site or helping document reported issues on github.

Tooling development

The main development of our site is led by our staff team, but anyone can submit pull requests based on open issues, or minor UI bugs.

ℹ️ Please read the contribution guidelines before submitting any code.

Dataset releases

The complete text and voice dataset for languages where we have data is currently generated by the Common Voice staff team.

Currently, we are generating a new version of the datasets two times per year and publishing them on our site.

ℹ️ Note that we are asking for an email to send the link to the dataset (instead of direct download) because we want to have a way to contact everyone who downloaded the data in case we get deletion requests from contributors.

We understand that some people might want more frequent releases, and we are working on a more continuous release model to accommodate these needs.

Roles

These are some roles you can take as part of this community.

Channels

💬 If your language already exists on Common Voice, make sure you check and join the local discourse and matrix room. If that’s not the case, please create a new topic on discourse asking for one to be created.

🌍 Localization

Our purpose

Adapting the project tools and material to be understood by a specific audience.

Who we are

We are a community of translators and linguists that localize the original English content into our languages.

🔨 English knowledge and deep understanding of our local language and culture are key for this work.

What’s success

Localize the project tools into our language, mainly the Common Voice site.

How to join

Anyone can join this community. Join our discourse forums or our matrix chat and introduce yourself, jump into our localization tool and check the status of your language.

What we do

Localization

We use Mozilla’s localization tool, Pontoon, to translate the Common Voice strings. Please create an account and check your language on Common Voice Pontoon section.

ℹ️ Please read how to use pontoon before starting to use the tool, you might need to ask the Mozilla localization team for permissions to validate suggestions.

🔨 Skills required to help: English knowledge, strong knowledge of your language.

Roles

These are some roles you can take as part of this community.

Channels

💬 If your language already exists on Common Voice, make sure you check and join the local discourse and matrix room. If that’s not the case, please create a new topic on discourse asking for one to be created.