Why do long-horizon agents fail?

According to LongDS-Bench, 52% to 69% of long-horizon agent failures stem from losing or corrupting the evolving state. The dominant cause is not a single weak reasoning step but the inability to carry state accumulated over an average of 11.3 turns accurately into the later turns. Accuracy drops by roughly 47 points from early to late turns, and even the best model reaches only 48.45% average accuracy.

Why are single-turn evaluations insufficient?

Single-turn benchmarks grade one answer per question, so they measure only the quality of a single reasoning step and structurally fail to capture the ability to carry state across 11.3 turns. LongDS-Bench isolates three state-evolution skills—rollback, counterfactual perturbation, and multi-state composition—to expose the moment an agent loses state.

Why AI Agents Collapse at Turn 11: The State-Tracking Wall Behind Flashy Demos

The failure of long-horizon agents comes not from a single weak reasoning step but from losing or corrupting evolving state, and that is the central finding of LongDS-Bench. Researchers at Zhejiang University and DataMind built the benchmark in May 2026 from 68 real Kaggle notebooks (2,225 turns total, with an average dependency span of 11.3 turns), isolating the state-evolution skills that single-turn evaluations miss. As a result, accuracy drops by roughly 47 points from early to late turns, 52%–69% of all failures are long-horizon dependency errors, and even the best model reaches only 48.45% average accuracy.

Why "State" Became the Axis of the Benchmark

LongDS-Bench is a long-horizon, multi-turn data-analysis benchmark that measures an agent's ability to maintain, update, restore, and compose evolving analytical state. Released in 2026 by researchers at Zhejiang University and DataMind, it was built by extracting 2,225 turns from 68 real Kaggle notebooks, spanning six domains including Geoscience, Business, and Education.

Here "state" means the cumulative context in which variables, filters, and intermediate results created in earlier turns become inputs for later ones. The average dependency span of 11.3 turns means a single decision affects results an average of 11 turns later. Unlike most prior benchmarks, which grade one answer per question and thus measure the quality of a single reasoning step, LongDS-Bench elevates this accumulating context itself into the object of scoring. In effect, the unit of evaluation shifts from a single move to a whole game.

The Three Skills That Split State Apart

LongDS-Bench isolates state-evolution ability along three axes. Rollback is the ability to return to a faulty intermediate step and restore a prior state. Counterfactual perturbation is the ability to consistently update dependent results when one earlier assumption is changed. Multi-state composition is the ability to combine multiple states from different points in time into one.

These three axes map almost exactly onto how a real analyst works: reverting a hypothesis, changing a premise and recomputing, and merging several intermediate results into a conclusion. The key shift is that evaluation moves from "getting the answer right" to "sustaining the workflow."

How to Read the 47-Point Late-Turn Drop

Agent accuracy plunges by roughly 47 points from early turns to late turns. When LongDS-Bench evaluated five state-of-the-art models in 2026, even models that performed well early on consistently collapsed in later turns as dependencies accumulated. This decline is not a flaw of one model but a shared pattern, and even the highest-performing model stayed at 48.45% average accuracy.

One caveat in reading the number: the 47 points is less a matter of "the problems got harder" than of "the earlier state got lost." Later turns are not intrinsically tougher questions; rather, an early mistake propagates 11 turns downstream and the error compounds. That reading fits practice better.

Why More Than Half of Failures Are "State Corruption"

Between 52% and 69% of long-horizon agent failures stem from losing or corrupting the evolving state. The diagnosis is that the dominant cause is not that the model fails at any one reasoning step, but that it cannot carry accumulated state accurately into the later turns.

Dimension	Single-Turn View	LongDS-Bench Diagnosis
Where failure sits	Individual reasoning step	Corruption of accumulated state
Unit of measurement	One question	11.3-turn average dependency
Late-turn accuracy	Not measured	~47-point drop vs. early
Dominant error	Fragmentary	Long-horizon errors 52%–69%

The implication is clear: mechanisms that safely store, restore, and compose state matter more for a reliable agent than smarter single-step reasoning.

ASAP's Take: Adoption and the Open Questions

The result lands directly on practitioners. Teams evaluating a data-analysis or research agent should judge it not by demo success rates but by its ability to carry state past ten turns. Asking up front "how many turns is this task?" and "will we need to roll back or change a premise midway?" lowers the cost of failure before it happens.

Limits and open questions remain, though. Because the benchmark is built on Kaggle notebooks, it leans toward structured data analysis, and whether this pattern carries over to other long-horizon tasks such as code execution or document writing is still to be confirmed. The diagnosis that state corruption is the cause is clear, but how much of the 47-point gap prescriptions like external memory or checkpointing can close is a matter for follow-up work. What is clear is the next phase: rather than competing to raise single-turn scores, the center of gravity in agent research is shifting toward evaluating and strengthening state tracking, rollback, and composition.

Reference: LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis (Xu, Lu, Qiao et al., 2026)