ASAPAi Soon As Possible · AI & tech, delivered fastest
Article

Why AI Agents Collapse at Turn 11: The State-Tracking Wall Behind Flashy Demos

AASAP
2026-06-19 · 3 min read

The failure of long-horizon agents comes not from a single weak reasoning step but from losing or corrupting evolving state, and that is the central finding of LongDS-Bench. Researchers at Zhejiang University and DataMind built the benchmark in May 2026 from 68 real Kaggle notebooks (2,225 turns total, with an average dependency span of 11.3 turns), isolating the state-evolution skills that single-turn evaluations miss. As a result, accuracy drops by roughly 47 points from early to late turns, 52%–69% of all failures are long-horizon dependency errors, and even the best model reaches only 48.45% average accuracy.

What Does LongDS-Bench Measure?

LongDS-Bench is a long-horizon, multi-turn data-analysis benchmark that measures an agent's ability to maintain, update, restore, and compose evolving analytical state. Released in 2026 by researchers at Zhejiang University and DataMind, it was built by extracting 2,225 turns from 68 real Kaggle notebooks, spanning six domains including Geoscience, Business, and Education.

The key concept is "state." Rather than a single question-and-answer pair, the benchmark targets cumulative work in which variables, filters, and intermediate results created in earlier turns become inputs for later ones. The average dependency span is 11.3 turns, meaning a single decision affects results an average of 11 turns later.

Why Single-Turn Evaluations Fall Short

Single-turn benchmarks are structurally blind to the moment an agent loses state, the failure mode behind 52-69% of long-horizon errors in LongDS-Bench. Grading one answer per question measures the quality of a single reasoning step but fails to capture the ability to carry state across 11.3 turns.

To address this, LongDS-Bench isolates and evaluates three state-evolution skills.

  1. Rollback: the ability to return to a faulty intermediate step and restore a prior state.
  2. Counterfactual perturbation: the ability to consistently update dependent results when one earlier assumption is changed.
  3. Multi-state composition: the ability to combine multiple states from different points in time into one.

Accuracy Collapses Toward the Later Turns

Agent accuracy plunges by roughly 47 points from early turns to late turns. When LongDS-Bench evaluated five state-of-the-art models in 2026, even models that performed relatively well early on consistently collapsed in later turns as dependencies accumulated.

This decline is not specific to one model but a shared pattern. Even the highest-performing model stayed at 48.45% average accuracy, and a sub-50% score shows just how hard it is to hold state to the end of a long-horizon task.

The Cause of Failure Is "State Corruption"

Between 52% and 69% of long-horizon agent failures stem from losing or corrupting the evolving state. In other words, the dominant cause of failure is not that the model fails at any one reasoning step, but that it cannot carry accumulated state accurately into the later turns.

DimensionSingle-Turn ViewLongDS-Bench Diagnosis
Where failure sitsIndividual reasoning stepCorruption of accumulated state
Unit of measurementOne question11.3-turn average dependency
Late-turn accuracyNot measured~47-point drop vs. early
Dominant errorFragmentaryLong-horizon errors 52%–69%

The implication is clear: mechanisms that safely store, restore, and compose state matter more for a reliable agent than smarter single-step reasoning.

The Gap Between Flashy Demos and Real Reliability

A flashy short demo is not evidence of long-horizon reliability, the property LongDS-Bench was built in 2026 to measure. Demos usually end within a few turns, but real data analysis runs past 11 turns while continuously depending on earlier state. LongDS-Bench quantifies exactly this "wall at turn 11," exposing the gap that agents leave between demo and practice as of 2026.

The next challenge is also clear. Rather than competing to raise single-turn reasoning scores, the next phase of agent research is to evaluate and strengthen state tracking, rollback, and composition.


Reference: LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis (Xu, Lu, Qiao et al., 2026)

← All posts