What needs to be done?

Tasks

Cultural and Linguistic Influence: The impact of LLMs on cultural and linguistic identity should be recognized and examined thoroughly, as these models have the potential to shape and define these aspects.

Control and Ownership: The control over corpora, which ultimately determines the behavior of LLMs, is concentrated in the hands of private entities. The corpora of linguistic and cultural communities should be created, curated and maintained by the communities with appropriate licenses.

Bias and Representation: The process of curating training data involves review, filtration, and pre-processing to minimize bias. Communities should own the responsibility for bias and representation of their respective corpora.

Approach

Communities need to take control of their cultural and linguistic identity by 

  1. creation, curation and maintenance of their corpora and 
  2. reinforcement learning from human feedback

Case Study

Iceland provides an interesting case study “How Iceland is using GPT-4 to preserve its language”. The corpus contains 300,000 Icelandic language examples. The corpus alone was not sufficient to produce grammatically correct Icelandic. A team of 40 volunteers trained GPT-4 on proper Icelandic grammar and cultural knowledge using Reinforcement Learning from Human Feedback.

Governance

Communities need to give a mandate to a group of qualified people to manage their corpus. Their curation process must be transparent. 

Technical Considerations

A global corpus could be a federated system of the community corpora decentralized stored and processed. Recently progress has been made with decentralized training as reported in this paper by Binhang Yuan et al.

Contributions

The institute for 

  1. Help raise awareness, specifically in non-english communities
  2. Support conceptual and technical guidance to linguistic and cultural communities