Jailbroken Frontier Models Stay Smart: The Vanishing "Jailbreak Tax"

A jailbroken frontier model is still nearly as capable as it was before, and the stronger the model, the smaller that loss is. In an April 2026 paper, Anthropic's Daniel Zhu and colleagues evaluated 28 jailbreaks across five benchmarks and found that Haiku 4.5 loses an average of 33.1% of its benchmark performance when jailbroken, while Opus 4.6 at max thinking effort loses only 7.7%. In other words, the "jailbreak tax" shrinks as a model gets more capable.

An Old Safety Assumption Just Broke

The jailbreak tax is the capability degradation a model suffers when it is jailbroken. Prior safety discussions leaned on the assumption that "complex jailbreaks make a model dumb, so even harmful outputs are useless." Anthropic's team tested this assumption head-on in 2026 and showed that it collapses for frontier models.

Intuitively, jailbreaks encrypt inputs and outputs or twist the model through roleplay, so they ought to disrupt reasoning. But the measured numbers say that disruption gets smaller the more capable the model is.

The Curve Where Capability Buys the Tax Down

The jailbreak tax is inversely related to model capability, so a stronger Claude model loses less of its skill when jailbroken. Lining up Claude models by capability from Haiku 4.5 to Opus 4.6, the average performance loss under jailbreak falls steadily from 33.1% to 7.7%.

Model	Average performance loss when jailbroken
Haiku 4.5	~33.1%
(mid-capability tier)	gradual decline
Opus 4.6 (max thinking)	~7.7%

The takeaway is simple. A highly capable model recovers most of its original skill even when twisted by encryption or roleplay.

Why Stronger Models Are Less Vulnerable (Analysis)

Read the curve in reverse and an uncomfortable implication appears. Jailbreaks eat into capability because unwrapping the "shell" of encryption or roleplay costs the model cognitive effort. A weaker model spends all of that effort on the shell and then falls apart on the actual task, while a strong model passes through the shell easily and deploys its full underlying skill.

The troubling part is the direction. The capability-improvement curve the industry celebrates lines up exactly with the shrinking-safety-margin curve. The very work of making models smarter is what produces models that "stay smart even when jailbroken." If capability and safety share the same dial, every acceleration of the performance race automatically thins this defense.

The Strongest Jailbreaks Cost Almost Nothing

Top-tier jailbreaks are essentially free in capability terms, meaning the most advanced attacks impose almost no measurable loss. Boundary Point, the strongest jailbreak against deployed classifiers, achieves near-perfect classifier evasion while causing near-zero degradation across safeguarded models.

The 28 jailbreaks the paper evaluated break down as follows:

19 cipher-based jailbreaks — obfuscate inputs and outputs to bypass safeguards.
9 non-cipher jailbreaks — rely on roleplay, prompt injection, and adversarial suffixes.

Both groups were run through five benchmarks, letting the authors compare jailbreak strength against capability loss in one sweep.

Which Tasks Break the Most?

The jailbreak tax is uneven across task types in the Anthropic study, and reasoning-heavy work is hit far harder than knowledge recall. Across all models, reasoning-heavy tasks showed considerably more degradation than knowledge-recall tasks. Even for the most capable models, reasoning tasks retained some loss, while knowledge-recall benchmarks were nearly unaffected by jailbreaks.

So the "jailbreaks make models dumb" shield works only partially, and only on reasoning tasks; it does essentially nothing to stop a model from surfacing dangerous factual knowledge intact.

How to Read the Numbers (Analysis)

The 7.7% figure should be read as a warning, not a reassurance. What matters in a risk assessment is not the average loss but the capability that remains. If Opus 4.6 keeps more than 92% of its skill, then from an attacker's point of view a jailbroken frontier model is effectively an intact tool.

The variance across tasks is no comfort either. The finding that knowledge recall is nearly untouched by jailbreaks means capability loss offers no defense in exactly the scenario where an attacker wants to extract dangerous factual information. That said, this paper measures only Claude models on a specific set of five benchmarks. Whether the same curve reproduces on other architectures or other task families needs separate verification, and the sample is too narrow to generalize "capability equals a shrinking safety margin" to every model.

Implications for Safety Design

A safety case is unsound if it rests on the premise that jailbroken models become dumb. The Anthropic authors conclude that safety cases for frontier models should not depend on a meaningful capability degradation from jailbreaks, because that assumption's safety margin disappears by Opus 4.6 in 2026.

Two practical consequences follow. First, external defenses such as refusals and detection or classifiers become the more central line of defense than the side effect of capability loss. Second, risk assessments should start from the worst-case assumption that "a jailbroken model can do nearly everything," not "a jailbroken model is less useful."

Reference: Jailbroken Frontier Models Retain Their Capabilities (Zhu et al., 2026)