Leading AI models are more vulnerable to malicious prompts than vendors claim

This audio is auto-generated. Please let us know if you have feedback.

Dive Brief:

Major AI developers’ model-safety claims rest on incorrect assumptions about how hackers behave, Cisco researchers said in a report published on Wednesday.
AI vendors assume that their models are safe from hijacking if they can fend off a single malicious prompt at a time, but hackers are increasingly using multistage prompts to evade model defenses, Cisco said, and most models aren’t prepared for those kinds of attacks.
The new report illustrates a mostly underappreciated danger lurking inside AI models, one that could expose businesses using these tools to a wide range of disruptions and harm.

Dive Insight:

Cisco’s evaluation of 15 leading AI models from OpenAI, Anthropic, Google, Amazon and xAI “found that single-turn attack success rate (ASR) is not a reliable proxy for what happens when an attacker can adapt across turns,” researchers Nicholas Conley and Amy Chang wrote.

Their tests revealed that AI models were much more susceptible to multi-turn malicious prompts — success rates ranged from 8% to 88%, compared with a range of 2% to 65% for single-turn prompts.

“Every model we tested exhibited non-trivial multi-turn ASR,” Conley and Chang wrote.

The two researchers previously collaborated on a November 2025 report that found open-weight AI models were between two and 10 times as vulnerable to multi-turn attacks as they were to single-turn attacks.

“The pattern we documented in open models holds in closed ones,” they wrote in their new study. “No frontier closed model in this cohort can be characterized as safe under iterative attack. This is a claim about the current state of the closed-model frontier, not about any single vendor.”

One of the study’s most significant findings was a correlation between AI companies’ priorities and their models’ safety. Conley and Chang found that AI developers that publicly emphasized their models’ increasing power produced models with the biggest gap between vulnerability to single-turn attacks and vulnerability to multi-turn attacks. Developers whose public statements emphasized model safety had smaller disparities, suggesting a more concerted effort to minimize risks.

The researchers tested five strategies: role-playing, misdirecting models, information decomposition, reframing model refusals and incremental escalation. An xAI model, Grok 4.1 Fast Non-Reasoning, performed the worst, with researchers succeeding in 88% of their multi-turn attacks. (They succeeded in 34% of single-turn attacks against the model.)

The best-performing model, Amazon’s Nova 2 Lite, only failed to withstand 8% of multistage attacks, although the researchers said that that figure “still represents meaningful residual risk.”

Conley and Chang noted that Grok 4.1 performed significantly better with reasoning enabled, suggesting that AI vendors should “document the safety-relevant effects” of configuration decisions like reasoning status.

OpenAI, Anthropic, Google, Amazon and xAI did not immediately respond to requests for comment.

Vendors need to rethink how they evaluate AI model safety, the researchers said, and businesses need more information about potential gaps between models’ single-turn and multi-turn attack resilience.

“For business decisions made on the basis of published single-turn scores, this presents security and governance risk,” Conley and Chang wrote. “A model with 2.74% single-turn ASR is not the same product as a model that holds the line at 24.68% multi-turn ASR. Without paired-regime data, the two are indistinguishable on most public evaluations, and the end user never sees the gap.”

Source link