Community Playbook

Logo

https://commonvoice.mozilla.org

View the Project on GitHub common-voice/community-playbook

Playbook Pages

📝 Text corpus

Our purpose

Collect or generate text corpus under public domain licence that can be read by people to facilitate their voice donations. Public domain texts are creative works that have no exclusive intellectual property rights.

Who we are

We are a community of text collectors and creators, always looking for places with text corpora we can extract and process so it can be transformed into short and simple sentences for people to read.

What’s success

Generate as many sentences as possible in our languages. Having more sentences allows contributors to donate more hours of voice data.

⚠️ You will need at least 5000 validated sentences to have your language enabled for voice contributions on our voice collection site.

How to join

Anyone can join this community. Join our discourse forums or our matrix chat, introduce yourself and jump into our sentence tools right away.

What we do

Sentence extraction

We have developed a tool to extract sentences from large sources of public domain text, with a focus easy-to-read corpus and Wikipedia.

This is the easiest and fastest way to get more than a million sentences as soon as possible for your language.

ℹ️ Please read the tool documentation on how to generate specific rules for your language.

⚠️ Important: Due to legal reasons Mozilla needs to be the one running the final extraction, so please don’t do any manual processing to the resulting extraction during your tests. We can apply manual clean-up after the final version is generated by Mozilla.

🔨 Skills required to help: Command line usage and git, familiar with regular expressions.

Sentence collection

We have also created a sentence collection tool that allows contributors to collect and validate sentences created by the community. You can use this tool also to import and clean-up small-to-medium-sized public domain corpus you have found or collected.

ℹ️ Please read the collector how-to before using this tool and check the community guidelines on how to validate sentences.

🔨 Skills required to help: Strong grammar knowledge of the target language you are contributing to.

Bulk submission

If you know of a public domain corpus of sentences with more than 100k sentences, you can manually submit a pull request to add this as a bulk dataset. However, you will need to manually perform QA (quality assurance) to make sure the sentences are valid and high-quality.

This Discourse post has a more detailed guide for how to do manual QA, but in brief:

We’re looking for less than 5% of error rate on the random sample. You can use this tool with a confidence level of 99% and a margin of error of 2% to determine the sample size you need to review.

Feel free to set up this QA however makes most sense for you, but here’s a sample Google Spreadsheets template.

Once the review is complete, submit a pull request with the # of sentences submitted, a link to the manual QA results, and the % error rate. Here’s an example PR.

Flagging problematic sentences

If you notice sentences that need to be deleted, first check what the source of the sentence is. If the file that the sentence is located in is called sentence-collector.txt, that means it was automatically exported from Sentence Collector. In that case, please file an issue on the Sentence Collector repo with a plaintext file of all of the problem sentences, with one sentence per line.

If the sentence is from a different source, you can file a pull request that modifies the text file directly. If possible, also attach a separate plaintext file that has all of the problem sentences, with one sentence per line.

Automatic Translation

Some language communities have used automatic translation of high-resource languages corpus’ licensed under cc0 and adapted materials for cultural context in the language. After translating the texts, they have reviewed the sentences through the sentence collector. The Common Voice Team doesn’t have a formal position on automatic translation for sentence collection.

Tooling development

Contributors also develop, maintain and update the sentence extractor and collector code.

Collaborating with publishers of copyrighted works

We build relationships with organisations or individuals who would be happy to donate their text under a Public domain license.

⚠️ _Important: Due to legal reasons Mozilla you will have to confirm the agreement with the copyright owner. This process is outlined on cc0waiver_process page.

Roles

These are some roles you can take as part of this community.

Channels

💬 If your language already exists on Common Voice, make sure you check and join the local discourse and matrix room. If that’s not the case, please create a new topic on discourse asking for one to be created.