One malicious prompt gets blocked, while ten prompts get through. That gap defines the difference between passing benchmarks and withstanding real-world attacks — and it’s a gap most enterprises don’t know exists.
When attackers send a single malicious request, open-weight AI models hold the line well, blocking attacks 87% of the time (on average). But when those same attackers send multiple prompts across a conversation via probing, reframing and escalating across numerous exchanges, the math inverts fast. Attack success rates climb from 13% to 92%.
For CISOs evaluating open-weight models for enterprise deployment, the implications are immediate: The models powering your customer-facing chatbots, internal copilots and autonomous agents may pass single-turn safety benchmarks while failing catastrophically under sustained adversarial pressure.
“A lot of these models have started getting a little bit better,” DJ Sampath, SVP of Cisco’s AI software platform group, told VentureBeat. “When you attack it once, with single-turn attacks, they’re able to protect it. But when you go from single-turn to multi-turn, all of a sudden these models are starting to display vulnerabilities where the attacks are succeeding, almost 80% in some cases.”
Why conversations break open-weight models open
The Cisco AI Threat Research and Security team found that open-weight AI models that block single attacks collapse under the weight of conversational persistence. Their recently published study shows that jailbreak success rates climb nearly tenfold when attackers extend the conversation.
The findings, published in “Death by a Thousand Prompts: Open Model Vulnerability Analysis” by Amy Chang, Nicholas Conley, Harish Santhanalakshmi Ganesan and Adam Swanda, quantify what many security researchers have long observed and suspected, but couldn’t prove at scale.
But Cisco’s research does, showing that treating multi-turn AI attacks as an extension of single-turn vulnerabilities misses the point entirely. The gap between them is categorical, not a matter of degree.
The research team evaluated eight open-weight models: Alibaba (Qwen3-32B), DeepSeek (v3.1), Google (Gemma 3-1B-IT), Meta (Llama 3.3-70B-Instruct), Microsoft (Phi-4), Mistral (Large-2), OpenAI (GPT-OSS-20b) and Zhipu AI (GLM 4.5-Air). Using black-box methodology — or testing without knowledge of internal architecture, which is exactly how real-world attackers operate — the team measured what happens when persistence replaces single-shot attacks.
The researchers note: “Single-turn attack success rates (ASR) average 13.11%, as models can more readily detect and reject isolated adversarial inputs. In contrast, multi-turn attacks, leveraging conversational persistence, achieve an average ASR of 64.21% [a 5X increase], with some models like Alibaba Qwen3-32B reaching an 86.18% ASR and Mistral Large-2 reaching a 92.78% ASR.” The latter was up 21.97% from a single-turn.
The results define the gap
The paper’s research team provides a succinct take on open-weight model resilience against attacks: “This escalation, ranging from 2x to 10x, stems from models’ inability to maintain contextual defenses over extended dialogues, allowing attackers to refine prompts and bypass safeguards.”
Figure 1: Single-turn attack success rates (blue) versus multi-turn success rates (red) across all eight tested models. The gap ranges from 10 percentage points (Google Gemma) to over 70 percentage points (Mistral, Llama, Qwen). Source: Cisco AI Defense
The five techniques that make persistence lethal
The research tested five multi-turn attack strategies, each exploiting a different aspect of conversational persistence.
-
Information decomposition and reassembly: Breaks harmful requests into innocuous components across turns, then reassemble them. Against Mistral Large-2, this technique achieved 95% success.
-
Contextual ambiguity introduces vague framing that confuses safety classifiers, reaching 94.78% success against Mistral Large-2.
-
Crescendo attacks gradually escalate requests across turns, starting innocuously and building to harmful, hitting 92.69% success against Mistral Large-2.
-
Role-play and persona adoption establish fictional contexts that normalize harmful outputs, achieving up to 92.44% success against Mistral Large-2.
-
Refusal reframe repackages rejected requests with different justifications until one succeeds, reaching up to 89.15% success against Mistral Large-2.
What makes these techniques effective isn’t sophistication, it’s familiarity. They mirror how humans naturally converse: building cBntext, clarifying requests and reframing when initial approaches fail. The models aren’t vulnerable to exotic attacks. They’re susceptible to persistence itself.
Table 2: Attack success rates by technique across all models. The consistency across techniques means enterprises cannot defend against just one pattern. Source: Cisco AI Defense
The open-weight security paradox
This research lands at a critical inflection point as open source increasingly contributes to cybersecurity. Open-source and open-weight models have become foundational to the cybersecurity industry’s innovation. From accelerating startup time-to-market, reducing enterprise vendor lock-in and enabling customization that proprietary models can’t match, open source is seen as the go-to platform by the majority of cybersecurity startups.
The paradox isn’t lost on Cisco. The company’s own Foundation-Sec-8B model, purpose-built for cybersecurity applications, is distributed as open weights on Hugging Face. Cisco isn’t just criticizing competitors’ models. The company is acknowledging a systemic vulnerability affecting the entire open-weight ecosystem, including models they themselves release. The message isn’t “avoid open-weight models.” It’s “understand what you’re deploying and add appropriate guardrails.”
Sampath is direct about the implications: “Open source has its own set of drawbacks. When you start to pull a model that is open weight, you have to think through what the security implications are and make sure that you’re constantly putting the right types of guardrails around the model.”
Table 1: Attack success rates and security gaps across all tested models. Gaps exceeding 70% (Qwen at +73.48%, Mistral at +70.81%, Llama at +70.32%) represent high-priority candidates for additional guardrails before deployment. Source: Cisco AI Defense.
Why lab philosophy defines security outcomes
The security gap discovered by Cisco correlates directly with how AI labs approach alignment.
Their research makes this pattern clear: “Models that focus on capabilities (e.g., Llama) did demonstrate the highest multi-turn gaps, with Meta explaining that developers are ‘in the driver seat to tailor safety for their use case’ in post-training. Models that focused heavily on alignment (e.g., Google Gemma-3-1B-IT) did demonstrate a more balanced profile between single- and multi-turn strategies deployed against it, indicating a focus on ‘rigorous safety protocols’ and ‘low risk level’ for misuse.”
Capability-first labs produce capability-first gaps. Meta’s Llama shows a 70.32% security gap. Mistral’s model card for Large-2 acknowledges it “does not have any moderation mechanisms” and shows a 70.81% gap. Alibaba’s Qwen technical reports don’t acknowledge safety or security concerns at all, and the model posts the highest gap at 73.48%.
Safety-first labs produce smaller gaps. Google’s Gemma emphasizes “rigorous safety protocols” and targets a “low risk level” for misuse. The outcome is the lowest gap at 10.53%, with more balanced performance across single- and multi-turn scenarios.
Models optimized for capability and flexibility tend to arrive with less built-in safety. That’s a design choice, and for many enterprise use cases, it’s the right one. But enterprises need to recognize that “capability-first” often means “security-second” and budget accordingly.
Where attacks succeed most
Cisco tested 102 distinct subthreat categories. The top 15 achieved high success rates across all models, suggesting targeted defensive measures could deliver disproportionate security improvements.
Figure 4: The 15 most vulnerable subthreat categories, ranked by average attack success rate. Malicious infrastructure operations leads at 38.8%, followed by gold trafficking (33.8%), network attack operations (32.5%) and investment fraud (31.2%). Source: Cisco AI Defense.
Figure 2: Attack success rates across 20 threat categories and all eight models. Malicious code generation shows consistently high rates (3.1% to 43.1%), while model extraction attempts show near-zero success except for Microsoft Phi-4. Source: Cisco AI Defense.
Security as the key to unlocking AI adoption
Sampath frames security not as an obstacle but as the mechanism that enables adoption: “The way security folks inside enterprises are thinking about this is, ‘I want to unlock productivity for all my users. Everybody’s clamoring to use these tools. But I need the right guardrails in place because I don’t want to show up in a Wall Street Journal piece,'” he told VentureBeat.
Sampath continued, “If we have the ability to see prompt injection attacks and block them, I can then unlock and unleash AI adoption in a fundamentally different fashion.”
What defense requires
The research points to six critical capabilities that enterprises should prioritize:
-
Context-aware guardrails that maintain state across conversation turns
-
Model-agnostic runtime protections
-
Continuous red-teaming targeting multi-turn strategies
-
Hardened system prompts designed to resist instruction override
-
Comprehensive logging for forensic visibility
-
Threat-specific mitigations for the top 15 subthreat categories identified in the research
The window for action
Sampath cautions against waiting: “A lot of folks are in this holding pattern, waiting for AI to settle down. That is the wrong way to think about this. Every couple of weeks, something dramatic happens that resets that frame. Pick a partner and start doubling down.”
As the report’s authors conclude: “The 2-10x superiority of multi-turn over single-turn attacks, model-specific weaknesses and high-risk threat patterns necessitate urgent action.”
To repeat: One prompt gets blocked, 10 prompts get through. That equation won’t change until enterprises stop testing single-turn defenses and start securing entire conversations.
