In 2023, researchers collected 45 terabytes of text to train a major language model. Less than 0.1 percent represented the world’s roughly 3,000 minority languages. While the model’s proficiency increased dramatically, numerous linguistic communities remained barely represented.
“For speaking passes away and is forgotten, whereas writing remains.” — Herodotus
When ancient Athenians transitioned from oral storytelling to written texts, they preserved knowledge in a lasting and portable format. Today, a similar process unfolds digitally, as vast datasets feed into powerful algorithms. Societies contributing significantly to these datasets see their perspectives reflected; those without substantial written corpora risk being underrepresented in emerging technologies. Herodotus's observation from around 450 BCE remains relevant: oral traditions fade quickly, but writing preserves ideas and culture.
Currently, a gap exists between spoken languages and digital representation. UNESCO estimates there are about 7,000 languages globally, yet fewer than 350 have dedicated Wikipedia editions, and only 76 include more than 10,000 articles [UNESCO 2024, World Languages Report]. Research from the Allen Institute indicates that English alone accounts for over 60 percent of texts used to train major language models [Dodge 2024, Allen AI]. Consequently, many AI-driven systems—from policy recommendations to automated customer support—primarily reflect the linguistic and cultural contexts of widely documented languages.
The implications of limited digital representation are tangible for many communities. Aisha Abdullahi, a 27-year-old teacher in northern Nigeria, compiles Hausa folktales via WhatsApp because written Hausa content online is scarce. “If something isn’t online, people assume it doesn’t exist,” Abdullahi explained, highlighting how digital absence can obscure cultural knowledge. Her efforts contrast sharply with commercial algorithms trained primarily on formal government texts, resulting in AI outputs poorly suited to everyday contexts.
Training data significantly influences AI systems, often unintentionally privileging certain narratives. When most digitized Hausa texts are official government documents, summarization and translation models disproportionately reflect bureaucratic language, sidelining informal speech. Sociolinguists note this pattern clearly: AI-powered content moderation systems incorrectly label Indigenous protest speech as “toxic” at significantly higher rates compared to similar posts in English [Bender et al. 2021, Stochastic Parrots]. Such biases can subtly disadvantage certain populations in areas like employment, finance, and online moderation.
Speech-to-text technologies offer a partial solution. Initiatives such as Meta’s Massively Multilingual Speech project, which has compiled audio content from approximately 4,000 languages, represent meaningful progress [Pratap 2023, Meta AI]. However, these projects predominantly rely on structured audio sources—such as news broadcasts or religious texts—leaving everyday conversational speech underrepresented. Furthermore, without parallel written datasets, important applications such as legal documentation, medical advice, or educational materials still face significant limitations.
However, targeted initiatives can improve digital inclusion. Community-driven projects—from Māori dictionaries in New Zealand to Sámi-language websites in Northern Europe—have demonstrated that modest funding can significantly enhance the volume of available written text. Government policies can support these efforts by mandating publicly funded documents be published digitally in minority languages under open licenses. Technology companies can also proactively address these gaps by deliberately incorporating underrepresented languages into their training datasets. Researchers at Google Brain recently achieved a 43 percent improvement in Swahili-language recognition accuracy by implementing targeted training adjustments [Li 2025, Google Brain].
Ultimately, addressing these challenges requires coordinated efforts across communities, governments, and technology companies. Ensuring linguistic diversity in AI systems depends largely on the availability of comprehensive, representative data. Expanding digital resources and improving dataset inclusivity will help AI models better reflect and serve the diverse societies they intend to support.