Amnesty International Reveals Widespread Data Scraping Fuels Generative AI Systems, Endangers Privacy and Environment

Companies are extracting vast troves of online data through unlawful web scraping to build their generative artificial intelligence (AI) products, Amnesty International has warned in a new briefing today. This practice is enabling a mass invasion of privacy, making these systems unlawful by design.

“Companies across the world are supplying generative AI products under the veneer of efficiency and sophistication, but in reality, these systems perpetuate mass invasions of privacy through unlawful web scraping,” said Likhita Banerji, Head of the Algorithmic Accountability Lab at Amnesty International. “The extractive data pipeline, inherent design choices made by tech companies, and exploitative supply chains have enabled a paradigm of technology development that opens up a risk of mass abuse of human rights.”

Amnesty International’s report, titled “Unlawful by Design: Exposing the Human Rights Costs of Generative AI,” documents serious risks in large-scale data scraping and processing being used to build these systems. The briefing highlights privacy violations and adverse environmental impacts, particularly on historically marginalized communities.

The organization researched models powering some of the most popular standalone generative AI tools, including GPT 3 by OpenAI, Google’s Gemini, Meta’s Llama, DeepSeek, and tools from Midjourney and Stable Diffusion. These systems rely on extracting information from billions of public online posts and images often without explicit consent from individuals appearing in or creating them.

“This infringes on privacy by design,” Banerji explained. “As datasets powering AI models scale up, the presence of hateful and discriminatory content in their outputs also gets amplified, along with negative stereotypes and prejudices, especially along racial and gendered lines.”

The report emphasizes that these choices are not inevitable. Banerji urged companies to challenge their reliance on non-consensual training data extracted on a grand scale.

Racial, gender, and cultural biases are consistent features of generative AI systems, reflecting the real-world biases present in the web. This poses risks to the right to freedom of thought as these systems can influence users’ thoughts and shape personal beliefs through predictive suggestions.

The report also highlights the environmental costs associated with generative AI production. The higher processing needs of larger models require more energy-intensive chips, larger data centers, and consequently, more energy and water for operationalization. Generative AI production often results in a negative impact on communities already affected by droughts and shortages in electricity.

Google’s own sustainability report from 2024 noted a 48% increase in the company’s greenhouse gas emissions since 2019, attributable to data center and supply chain emissions. Similarly, Microsoft’s emissions increased by 29% between 2020 and 2024.

Amnesty International wrote to Google, OpenAI, Meta, Stability AI, Midjourney, and DeepSeek, giving them an opportunity to respond to the findings of the research briefing. The organization also wrote to Intel and VMware regarding risks of discrimination, and to Google, Microsoft, and Amazon about environmental harms associated with their generative AI systems.

At the time of publication, only Microsoft, Amazon, Intel, OpenAI, and Meta responded to Amnesty International. A summary of their responses is included in the briefing.

Amnesty International is calling on states to prohibit standalone generative AI systems that have been built using unlawful web scraping, defined as the bulk and mass collection of training data through the web. Companies must immediately cease the practice of unlawful non-consensual web scraping for AI training purposes, and states must hold companies accountable for their involvement in any human rights abuses linked to their design and business choices.

The briefing provides a detailed human rights analysis of the ‘data pipeline’ that powers generative AI products, including stages of data capture, analysis, and processing. This involves examining parameters and implications of design choices related to training data of generative AI models, with a focus on methods and sources of data collection, data processing, model scaling, and data outputs.

Amnesty International defines standalone generative AI tools as products that are developed, deployed, and marketed for their generative AI capabilities solely and specifically. This does not include products where generative AI is an added feature or function in a larger suite of products, such as word processing software with optional generative AI features.

Source: https://www.amnesty.org/en/latest/news/2026/05/global-enormous-data-pipelines-powering-major-generative-ai-systems-are-rooted-in-mass-invasions-of-privacy-by-design/

Thinking about building an AI product?

Get in Touch