Mozilla Data Collective seeks to build AI's data economy around trust
Generative artificial intelligence has a data problem. For years, the typical approach to building gen AI models has been to gather as much data as possible by scraping vast swaths of the internet, training at an enormous scale and dealing with the consequences later. The result has been increasingly powerful technology, but also growing concerns [...
Key Takeaways
- SiliconANGLE UPDATED 15:23 EDT / JUNE 13 2026 AI Mozilla Data Collective seeks to build AI's data economy around trust by Paul Gillin Generative artificial intelligence has a data problem.
The result has been increasingly powerful technology, but also growing concerns about bias, consent, ownership and the uneven distribution of value created from the world's information.
- "It's a big, structural problem, and it requires a structural solution.
" Datasets assembled through indiscriminate web scraping often reproduce the same limitations and biases found online, Lewis-Jong said.
- Common Voice demonstrated that people are willing to contribute data when they believe their contributions are meaningful and they have a voice in how the project is governed.
More than a half-million contributors have participated across hundreds of languages, helping create one of the world's largest publicly available voice datasets.
- In its model, sovereignty doesn't necessarily mean restricting access.
Instead, it gives communities the ability to decide for themselves how their data will be used.
- Many of these resources would be difficult or impossible to find through conventional commercial data channels.
Stats & Key Facts
- #Today, the collective hosts hundreds of curated datasets representing more than 300 languages.

SiliconANGLE UPDATED 15:23 EDT / JUNE 13 2026 AI Mozilla Data Collective seeks to build AI's data economy around trust by Paul Gillin Generative artificial intelligence has a data problem. For years, the typical approach to building gen AI models has been to gather as much data as possible by scraping vast swaths of the internet, training at an enormous scale and dealing with the consequences later. The result has been increasingly powerful technology, but also growing concerns about bias, consent, ownership and the uneven distribution of value created from the world's information.
Mozilla Data Collective was created to fill the gaps in this model. The organization, which launched last November, is attempting to create a different kind of marketplace for AI data built around community ownership, consent and what founder and Chief Executive E. Lewis-Jong calls "fair value exchange.
" "We need clean, abundant, contextualized, consentful datasets to build AI models worth having," Lewis-Jong said in a recent e-mail interview. "It's a big, structural problem, and it requires a structural solution. " Datasets assembled through indiscriminate web scraping often reproduce the same limitations and biases found online, Lewis-Jong said.
For more details please read the original article at SiliconANGLE AI.
Continue Learning
Comments
Sign in to join the conversation