Why Centralized AI Bias Highlights the Need for Decentralized Models

Big Tech has raced to roll out conversational AI models since the launch of ChatGPT in late 2022, but the companies building the models seem unable to resist manipulating them to reflect their company culture or to meet a particular political or ideological agenda. Because the models are closed source black boxes, the training data and the model mechanics are hidden, giving users no information about how responses are generated. The alternative is open, transparent models run and trained on decentralized systems that will be more trusted than the closed, corporate models we see today.

Bias in centralized LLMs

Since before the launch of ChatGPT, groups have warned about the dangers of bias in closed systems. These warnings often came from progressive critics of AI: those who said large language models (LLMs) were just “stochastic parrots” also warned that they “overrepresent hegemonic viewpoints and encode biases potentially damaging to marginalized populations.” Ironically, some of the strongest reactions to ChatGPT’s biases came from the other side of America’s political divide. Users of ChatGPT quickly noticed that the model could discuss Russian interference in the 2020 election, but would not respond when queried about Hunter Biden's laptop, which was also widely reported. Research has supported the allegation of bias: “We find robust evidence that ChatGPT presents a significant and systematic political bias toward the Democrats in the US, Lula in Brazil, and the Labour Party in the UK,” noted one study. Given the human element in constructing models, some element of bias is unavoidable, but when models are trained opaquely and then marketed as ‘neutral,’ users can be unknowingly subject to the bias of either the training data or the trainers –– bias that they are unable to inspect.

And the bias can go beyond the data inputs used. In early 2024 Google Gemini’s image creator received such scathing reviews that it was quickly ‘paused’ for ‘updates.’ In Google’s quest to avoid offending what it saw as mainstream political and social sensitivities, it forced its model to insert diversity in nearly all images, resulting in outcomes that are preposterously inaccurate, such as African and Asian Nazis and a diverse group of American founding fathers. Not only were these images wildly inaccurate, they were also offensive. Most importantly, they lifted the veil on the hidden manipulation risks inherent in proprietary, closed AI models developed and run by companies.

How do the models work?

All models are subject to the biases of their creators, but the image prompts to Google’s Gemini model are also run through an additional set of rules designed to match what Google believes are acceptable or desirable answers, such as increasing diversity. These rules may be well-intentioned, but they are hidden from users. With Gemini, the diversity rules were so obvious and so clumsy that the output was quickly the subject of global ridicule, as users vied to generate the most absurd result. Because image requests rely on the AI model to generate results, we know a similar bias and likely similar rules underlie every answer. And thanks to the image results, there was no question as to the bias which was obvious for everyone to see, but given the closed nature of these models, this manipulation is much harder to discern with just text responses.

Open, transparent AI is the answer

For LLMs to be widely trusted, rather than being trained and manipulated behind closed doors by corporations, they need to be built on a transparent foundation openly inspectable and free from opaque biases: this is only possible with open source models, that are provably trained on specific data sets. A number of open source projects such as Hugging Face, which has raised $400 million, are making great progress in building, developing and training these open models. These models can be open source and available for anyone to see, and can run on a decentralized network of computers, like the Fluence platform, that proves each result was executed against the model without manipulation. Highly resilient decentralized networks currently exist for payments and storage, and a number of GPU marketplaces like Aethir, Akash, Gensyn, and Io.net are being optimized to train and even run AI models. Decentralized networks are necessary because they operate globally on a wide range of infrastructure with no single owner, making them very hard to threaten or shut down. This quickly growing ecosystem includes GPU marketplaces for training and running models, platforms like Filecoin for storing the data, CPU platforms like Fluence for running models with provability, and open tooling for developing the models. With this infrastructure, open models will be a powerful force.

Is this realistic?

Google and Microsoft have spent billions of dollars developing their LLMs, which seems like an insurmountable lead, but we have seen these huge companies outcompeted before. Linux overcame Microsoft’s Windows’ decade head start and billions of dollars to become the leading operating system. The open source community worked together to build Linux, and we can expect a similar level of success in building and training open-source LLMs –– especially if we have a common platform that facilitates development. One near term strategy is that rather than compete head-to-head with the monolithic LLMs, like ChatGPT, smaller, domain-specific models with unique data sets may emerge that are more trusted in their particular topics. For example, we could see a children's oncology model that has exclusive use of the data from top children’s hospitals, and a single frontend could pull from a wide range of these domain-specific models, replicating a ChatGPT experience but on a transparent and trusted foundation.

Model aggregation is a viable path to creating a trusted alternative to corporate LLMs, but as important as building and training AI is the running of the model in way that is verifiable. No matter the inputs, scrutiny is on the outputs, and any organization running a model will be subject to pressure. Companies are subject to influence from politicians, regulators, shareholders, employees, and the general public, as well as armies of Twitter bots. But decentralized models, hosted by any storage provider anywhere in the world, run on an open, decentralized compute network — like Fluence — which can process auditable queries, is immune from both hidden bias and censorship and will be far more trustworthy.

Big Tech is aware of its bias problem, but it will have a very hard time supporting models that give answers unpopular with its employees, governments and customer constituencies, even if accurate. OpenAI will take steps to reduce the obvious bias and Google will update Gemini to be more historically accurate, but hidden bias in both will remain, and we should use this revelation of Big Tech’s manipulation as a welcome warning about the risks of relying on any centralized company developing and running AI models, no matter how well-intentioned. This is our call to build open, transparent, and decentralized AI systems we can trust.