Situation
European AI Act sets the bases with the requirement to disclose summaries of copyrighted data used for training.
Th EU has also sponsored initiatives towards achieving full digital language equality in Europe by 2030, specifically with the European Language Equality ELE and the European Language Grid ELG projects. As of June 2023, the Language Data Space is being built.
The Icelandic linguistic and cultural community set the example for self-sovereignty in partnership with OpenAI and is an excellent case study to follow.
International publishers, like Germany’s Axel Springer, are prepared to license their archives as training data and are currently negotiating deals with LLM service providers, the Financial Times reported.
Tasks
There are the following tasks: recognize and study the issue, take control, be responsible, create governance models and build the enabling tech.
1. Cultural and Linguistic Influence: The impact of LLMs on cultural and linguistic identity should be recognized and examined thoroughly, as these models have the potential to shape and define these aspects.
2. Control and Ownership: The control over corpora and the supplementary reinforcement learning from human feedback, which ultimately determine the behavior of LLMs, is concentrated in the hands of private entities. The corpora of linguistic and cultural communities should be created, curated and maintained by the communities. The corpora should be published under appropriate licenses and fair compensation to copyright holders.
3. Bias and Representation: The process of curating training data involves review, filtration, and pre-processing to minimize bias. Additionally reinforcement learning from human feedback might be necessary. Communities should share or own the responsibility for bias and representation of their respective corpora and human feedback.
4. Governance: Communities need to give a mandate to a group of qualified people to manage their corpus and apply a suitable governess process. Their curation process must be transparent.
5. Technical considerations: Corpora should be a federated system with the data sets in decentralized storage, for example using FileCoin / IPFS. Recently progress has been made with decentralized training as reported in this paper by Binhang Yuan et al.
Web 3 technology is suitable for implementing governance structures and processes and for building a smart contract based licensing and payment platform.
Contributions
You can support this effort by
- Help raise awareness
- Support conceptual and technical guidance to linguistic and cultural communities
- Join an open-source project for the necessary tech infrastructure