Hey HN, the Common Crawl Foundation is trying to expand the coverage
of our crawl to more languages, regions and cultures, and if you speak
a language other than English (LOTE) you can help!
By validating Language Identification data (LangID or LID): https://dynabench.org/tasks/text-language-identification
By contributing urls for our seed crawl: https://github.com/commoncrawl/web-languages
We're also organizing a Workshop on Multilingual Data Quality Signals
(WMDQS) with MLCommons and EleutherAI where we have a call for papers
open (https://wmdqs.org/cfp/) and a upcoming shared task on language
identification (https://wmdqs.org/shared-task/)