Problem

LLM’s influence on language and culture

The influence of artificial intelligence, particularly Large Language Models (LLMs) like ChatGPT, on the cultural and linguistic identity of communities raises critical concerns. LLMs’ responses, content, style, and language have the potential to shape and establish a dominant cultural and linguistic benchmark.

Control of LLMs

The control over LLMs lies in the hands of those who curate and manage the training data. This corpus, comprising various sources such as books, publications, social media posts, websites, and more, is meticulously reviewed, filtered, and pre-processed to ensure data quality and minimize biases.

However, the current landscape reveals that private companies like OpenAI, Google, Baidu, and Alibaba predominantly create and maintain this training data. While Google’s C4 (Colossal Clean Crawled Corpus) is publicly accessible, OpenAI’s training data remains proprietary, limiting transparency and public participation.

Copyright and licenses

Rich corpora exist, but can not be used for for training AI models due to the lack of the necessary licenses.