Jailbroken Frontier Models Stay Smart: The Vanishing "Jailbreak Tax"
A jailbroken frontier model is still nearly as capable as it was before, and the stronger the model, the smaller that loss is. In an April 2026 paper, Anthropic's Daniel Zhu and colleagues evaluated 28 jailbreaks across five benchmarks and found that Haiku 4.5 loses an average of 33.1% of its benchmark performance when jailbroken, while Opus 4.6 at max thinking effort loses only 7.7%. In other words, the "jailbreak tax" shrinks as a model gets more capable.
What Is the Jailbreak Tax?
The jailbreak tax is the capability degradation a model suffers when it is jailbroken. Prior safety discussions leaned on the assumption that "complex jailbreaks make a model dumb, so even harmful outputs are useless." Anthropic's team tested this assumption head-on in 2026 and showed that it collapses for frontier models.
Intuitively, jailbreaks encrypt inputs and outputs or twist the model through roleplay, so they ought to disrupt reasoning. But the measured numbers say that disruption gets smaller the more capable the model is.
The Tax Shrinks as Models Get Stronger
The jailbreak tax is inversely related to model capability, so a stronger Claude model loses less of its skill when jailbroken. Lining up Claude models by capability from Haiku 4.5 to Opus 4.6, the average performance loss under jailbreak falls steadily from 33.1% to 7.7%.
| Model | Average performance loss when jailbroken |
|---|---|
| Haiku 4.5 | ~33.1% |
| (mid-capability tier) | gradual decline |
| Opus 4.6 (max thinking) | ~7.7% |
The takeaway is simple. A highly capable model recovers most of its original skill even when twisted by encryption or roleplay.
The Strongest Jailbreaks Cost Almost Nothing
Top-tier jailbreaks are essentially free in capability terms, meaning the most advanced attacks impose almost no measurable loss. Boundary Point, the strongest jailbreak against deployed classifiers, achieves near-perfect classifier evasion while causing near-zero degradation across safeguarded models.
The 28 jailbreaks the paper evaluated break down as follows:
- 19 cipher-based jailbreaks — obfuscate inputs and outputs to bypass safeguards.
- 9 non-cipher jailbreaks — rely on roleplay, prompt injection, and adversarial suffixes.
Both groups were run through five benchmarks, letting the authors compare jailbreak strength against capability loss in one sweep.
Which Tasks Break the Most?
The jailbreak tax is uneven across task types in the Anthropic study, and reasoning-heavy work is hit far harder than knowledge recall. Across all models, reasoning-heavy tasks showed considerably more degradation than knowledge-recall tasks. Even for the most capable models, reasoning tasks retained some loss, while knowledge-recall benchmarks were nearly unaffected by jailbreaks.
So the "jailbreaks make models dumb" shield works only partially, and only on reasoning tasks; it does essentially nothing to stop a model from surfacing dangerous factual knowledge intact.
Implications for Safety Design
A safety case is unsound if it rests on the premise that jailbroken models become dumb. The Anthropic authors conclude that safety cases for frontier models should not depend on a meaningful capability degradation from jailbreaks, because that assumption's safety margin disappears by Opus 4.6 in 2026.
Two practical consequences follow. First, external defenses such as refusals and detection or classifiers become the more central line of defense than the side effect of capability loss. Second, risk assessments should start from the worst-case assumption that "a jailbroken model can do nearly everything," not "a jailbroken model is less useful."
Reference: Jailbroken Frontier Models Retain Their Capabilities (Zhu et al., 2026)