πŸ“ Text Corpus

Contributors and collaborators help to develop text corpus from original and new sources that are licensed under creative commons zero (CC0).

You can use a variety of methods such as;

⚠️ _ Mozilla Common Voice datasets are released under a CC0 β€œNo Rights Reserved” License and are part of the public domain. This means that works subject to copyright cannot be added to Common Voice datasets. But some copyright owners are willing to make a CC0 waiver, dedicating their work to the public domain so that it can be contributed to Common Voice.

Why is Sentence collection Important?

Currently, Common Voice requires voice donations to be tied to sentences, by sourcing more sentences people are able to donate more hours of voice data. Sentence Collection Bands]( were introduced to support the entry-level for the voice collection stage.

What should I consider when contributing Sentences?

It’s also important to ensure sentences are readable to speakers across all backgrounds.

Sentence Diversity

Phoneome, variant and domain diversity are crucial in ensuring that the dataset can understand the vastness of language; for example, some languages have Gramaitical Gender e.g Abogado and Abogada mean Male Lawyer and Female Lawyer respectively Spanish.

All sentences in the dataset can be viewed on the Common Voice Github. If you notice a gap regarding sources or types of content, we encourage you to add more sentences to help diversify the text corpus.

⚠️ _ As part of the Common Voice 2022 Product Roadmap we are scoping and delivering a domain-specific text corpus on the platform

Community Participation Guidelines (CPG)

It’s important that everyone and every language can have enjoyable experiences in contributing to Common Voice. Sentences that include harmful content or violations of the CPG, will be reviewed and subsequently deleted.

Skills Needed

Sentence extraction πŸ”¨ Skills required to help: Command line usage and git, familiar with regular expressions.

Sentence collection πŸ”¨ Skills required to help: Strong grammar knowledge of the target language you are contributing to.

Large corpus validation πŸ”¨ Skills required to help: Expertise in processing and cleaning up text, linguistics/language expertise to check the quality of the resulting sentences.

Tooling development

Contributors also develop, maintain and update the sentence extractor and collector code.