Artificial Analysis announced Fable 5 as No. 1 in its coding-agent rankings. The configuration pairing Claude Code with Fable 5 [max] took the lead on the Coding Agent Index with a score of 77, followed by OpenAI Codex + GPT-5.5 [xhigh] at 76 and Claude Code + Opus 4.8 [max] at 73. Fable 5 is Anthropic's latest model, released on June 9, 2026.

What is the DeepSWE benchmark?

DeepSWE is a benchmark that measures an AI's coding ability using real-world software-development tasks built entirely from scratch. Artificial Analysis said it had concluded that its previously used SWE-Bench Pro had become prone to inflated scores due to issues such as repository-history leakage, and replaced it with Datacurve's DeepSWE. The goal of DeepSWE is to reduce benchmark gaming by posing new tasks that do not rely on publicly available code history.

An evaluator retiring a benchmark that had grown vulnerable to gaming and replacing it with a new one reflects a broader push to preserve trust in AI performance figures. Moreover, with only a single point separating Fable 5 and GPT-5.5, the result shows just how fierce the race for the lead is in coding agents.

Anthropic's Fable 5 Takes No. 1 on the DeepSWE Coding Benchmark

Anthropic's Claude Fable 5 has claimed the top spot in the coding-agent evaluation run by AI benchmarking firm Artificial Analysis. Artificial Analysis replaced the previous SWE-Bench Pro with Datacurve's "DeepSWE," and Fable 5 paired with Claude Code took the lead with a score of 77. OpenAI's GPT-5.5 (Codex) followed with 76. Released on June 9, 2026, Fable 5 has now had its coding ability confirmed by an independent benchmark.

Reading the Leaderboard Closely

Taken at face value, the results are just three lines: Fable 5 [max] + Claude Code at 77, GPT-5.5 [xhigh] + Codex at 76, and Opus 4.8 [max] + Claude Code at 73. Yet that arrangement says two things at once. First, the gap between first and second is a single point; within any reasonable benchmark margin of error, it is safer to read that as a statistical tie. Second, on the same Claude Code harness, Fable 5 and Opus 4.8 differ by four points. In other words, holding the tooling constant, a change of model generation still moves measured performance meaningfully. The real message of this table is less the No. 1 headline and more the fact that model and tool are now being scored as a single package.

Why Retire a Benchmark at All

Artificial Analysis retiring SWE-Bench Pro in favor of DeepSWE is more than a routine metric update. Evaluations that lean on public repository history grow inflated over time, because models absorb the answer trails into their training. DeepSWE, which authors fresh tasks from scratch, is an attempt to cut off that contamination. The trend worth noting here is that benchmarks are getting shorter shelf lives: the more famous a metric becomes, the faster it gets swallowed into training data and loses its power to discriminate. Trustworthy coding evaluations are likely to keep drifting toward private, continually refreshed task sets.

How Strong Is Fable 5 at Coding?

Fable 5 has posted top-tier results across multiple coding metrics. According to published figures, Fable 5 scored 95.0% on SWE-bench Verified and 80.0% on SWE-bench Pro, and ranked first on the code-focused evaluation FrontierCode. With its new No. 1 finish on the DeepSWE-based Coding Agent Index added in, Fable 5 has been rated as one of the strongest coding models available at launch.

How a Korean Engineering Team Should Read These Numbers

A local team needs a layer of interpretation before treating this result as a reason to switch tools. DeepSWE tasks are English-based and do not reflect real-world variables such as parsing Korean-language requirements or wrangling in-house legacy code. A one-point lead is a thin basis for a migration. The more useful takeaway for practitioners is buried in the framing: what gets scored is not the model alone but its pairing with an agent tool like Claude Code. If you are weighing adoption, a small pilot on your own repository — measuring code-review pass rates and re-roll costs directly — is far more reliable than a leaderboard rank.

ASAP's Take: What to Watch For

This announcement carries a useful signal and a risk of over-reading in equal measure. The signal is clear: an evaluator voluntarily discarding a metric that had grown gameable is a healthy move to protect trust in the numbers. The caution is just as clear. A No. 1 finish is a snapshot at one moment on one family of tasks, and the instant a one-point gap is declared a verdict, it slides into marketing language. Fable 5 reaching the top right after its June 9 debut is impressive, but benchmarks flip fast. Treat the numbers as a reference and draw the conclusion inside your own workflow.

Sources: Artificial Analysis · AI Times · LLM-Stats