US artificial intelligence (AI) developers are sounding the alarm about “industrial scale” distillation attacks by Chinese labs looking to exfiltrate a variety of data from their models, but those same firms have also been widely accused of using others’ data without permission to train the models in the first place.
Distillation techniques are a common method for training AI, whereby small models are trained on the outputs of larger, more advanced models in an effort to replicate their performance and behaviour.
While distillation techniques allow AI labs to create smaller, more tailored models for customers at a much cheaper cost, US firms are worried the adversarial use of such methods by Chinese competitors presents a fundamental risk to their businesses.
In a blog post about detecting and preventing such attacks, AI developer Anthropic accused three Chinese firms – DeepSeek, MiniMax Group Inc and Moonshot AI – of violating its terms of service by collectively creating more than 24,000 fraudulent accounts, which were then used to generate more than 16 million exchanges with its publicly available Claude models.
“Distillation is a widely used and legitimate training method,” it said. “For example, frontier AI labs routinely distill their own models to create smaller, cheaper versions for their customers. But distillation can also be used for illicit purposes: competitors can use it to acquire powerful capabilities from other labs in a fraction of the time, and at a fraction of the cost, that it would take to develop them independently.”
It further warned that, because such campaigns are “growing in intensity and sophistication”, addressing the threat to US artificial intelligence companies “will require rapid, coordinated action among industry players, policymakers and the global AI community”.
OpenAI, developer of ChatGPT, has also recently flagged the threat of model distillation to US lawmakers, warning that DeepSeek had been using such techniques as part of “ongoing efforts to free-ride on the capabilities developed by OpenAI and other US frontier labs”.
In a letter to the US House Select Committee on Strategic Competition between the US and the Chinese Communist Party, dated 12 February 2026, OpenAI highlighted how Chinese firms are using “third-party routers” to circumvent access restrictions and lift the data.
“More generally, over the past year, we’ve seen a significant evolution in the broader model-distillation ecosystem,” it said. “For example, Chinese actors have moved beyond chain-of-thought (CoT) extraction toward more sophisticated, multi-stage pipelines that blend synthetic-data generation, large-scale data cleaning, and reinforcement-style preference optimisation.
“We have also seen Chinese companies rely on networks of unauthorised resellers of OpenAI’s services to evade our platform’s controls,” it continued. “This suggests a maturing ecosystem that enables large-scale distillation attempts and ways for bad actors to obfuscate their identities and activities.”
In the case of Anthropic, the developer detailed how Chinese firms were using commercial proxy services that resell access to Claude and other frontier AI models at scale. “These services run what we call ‘hydra cluster’ architectures: sprawling networks of fraudulent accounts that distribute traffic across our API [application programming interface] as well as third-party cloud platforms,” it said.
It added that each distillation campaign by the three Chinese firms was detectable due to abnormal usage patterns, with the volume, structure and focus of the prompts highlighting that a deliberate capability extraction was in progress.
“In one notable technique, their prompts asked Claude to imagine and articulate the internal reasoning behind a completed response and write it out step by step – effectively generating chain-of-thought training data at scale,” it said. “By examining request metadata, we were able to trace these accounts to specific researchers.”
Google has also separately complained in a report published on 12 February that its Gemini model has increasingly been targeted by distillation attacks, with one campaign creating over 100,000 prompts designed to “replicate Gemini’s reasoning ability in non-English target languages across a wide variety of tasks”.
It added that the “model extraction and subsequent knowledge distillation enable an attacker to accelerate AI model development quickly and at a significantly lower cost. This activity effectively represents a form of intellectual property (IP) theft.”
‘Fair use’ for me, ‘data theft’ for thee
Despite the concerns raised by AI developers, each of the firms have also been widely accused of stealing the underlying data used to train their own models.
In September 2025, for example, Anthropic agreed to pay $1.5bn to settle a class action lawsuit over its use of more than seven million pirated books to train Claude, and is currently facing a separate $3bn lawsuit from music publishers over its alleged pirating of more than 20,000 songs.
OpenAI is also facing 12 copyright cases in New York over their use of materials to train models without consent or compensation.
While those cases were consolidated in April 2025 – largely against the wishes of the individuals and news publishers suing the companies – a transfer order made by the US judicial panel on multidistrict litigation said the cases “share factual questions arising from allegations that OpenAI and Microsoft used copyrighted works, without consent or compensation, to train their large language models (LLMs) … which underlie defendants’ generative artificial intelligence products”.
AI model training without consent
In the UK, both Google and Microsoft are set to be sued over the allegedly unlawful collection and use of peoples’ personal data to train their AI models without consent.
The claim – which is being brought by Barings Law – has so far attracted 15,000 claimants, with the law firm alleging a raft of data privacy transgressions, including the collection of information regarding users’ voices, demographics, time spent on apps, and personal information including email addresses and the contents of emails.
A submission to the US Copyright Office on 30 October 2023 by Anthropic highlights how, in the eyes of model developers, at least, the use of copyrighted material is integral for creating generative AI systems.
“To the extent copyrighted works are used in training data, it is for analysis (of statistical relationships between words and concepts) that is unrelated to any expressive purpose of the work,” it said. “This sort of transformative use has been recognised as lawful in the past and should continue to be considered lawful in this case.”
It added that using copyrighted works to train its Claude model would count as “fair use” because “it does not prevent the sale of the original works, and, even where commercial, is still sufficiently transformative”.
As part of a separate legal case brought against Anthropic by major music publishers in November 2023, the firm took the argument further, claiming “it would not be possible to amass sufficient content to train a large language model like Claude in arm’s-length licensing transactions, at any price”.
Computer Weekly contacted Anthropic, OpenAI and Google about how the approaches of DeepSeek and other Chinese firms are materially distinct from their own approaches to using others’ IP, but received no response by time of publication.







