Why do AI scientists fail at autonomous discovery?

The 2026 arXiv paper by Bisht and colleagues argues it is because they are structurally mis-designed, not under-scaled. The four root causes are (1) problem selection skewed by the McNamara fallacy, (2) training corpora that omit tacit laboratory and failure knowledge, (3) preference tuning that compresses diversity toward consensus, and (4) benchmarks with no physical-experiment feedback loop.

What does the McNamara fallacy have to do with AI scientists?

The McNamara fallacy is the trap of optimizing only what is easy to measure while ignoring what matters. AI scientists drift toward easily quantifiable, measurable problems, producing a problem-selection bias that pushes scientifically important but hard-to-measure problems down the queue.

Can a bigger model or more scaffolding fix it?

The paper says no. The limits are a matter of design choices, not scale or scaffolding, so the fundamental design—problem selection, training data, post-training, and evaluation benchmarks—must be revisited.

Why "AI Scientists" Fail: Four Design Flaws That a Bigger Model Won't Fix

An "AI scientist" is a system that was not built for autonomous scientific discovery. A position paper published on arXiv in May 2026 by Harshit Bisht and colleagues argues that today's agentic AI scientists fall short of autonomy not merely because they are under-scaled, but because they are structurally mis-designed, and it pins the failure on four root causes. This article lays out those four design flaws, the "McNamara fallacy" framing, and why scaffolding cannot fix them, all in a direct-answer format.

The Core Claim: The Problem Is Design, Not Scale

The paper's central claim is that the limits of AI scientists are a matter of design choices, not scale. Bisht and colleagues observe that, as of 2026, AI scientists already function as "co-scientists," yet they are structurally unfit for fully autonomous discovery.

The prescription therefore changes too. Stacking on a bigger model or a more elaborate prompt chain (scaffolding) will not work; the fundamental design choices, namely training corpora, post-training, and benchmarks, must be revisited.

First, the McNamara Fallacy: Optimizing Only the Measurable

The first design flaw is that problem selection is skewed by the "McNamara fallacy." The McNamara fallacy is the trap of looking only at easy-to-measure metrics while ignoring what actually matters, and AI scientists drift toward problems that are easy to quantify and measure.

As a result, scientifically important but hard-to-measure or ambiguous problems get pushed down the queue. The very starting point of autonomous discovery, the question "what is worth solving?", is already tilted to one side.

Second and Third: Missing Tacit Knowledge and Consensus-Collapsing Preference Tuning

The second and third flaws are about data, not scale: what the 2026 model is trained on, and how it is then tuned. Both structurally erode the "learning from failure" and "novelty" that autonomous discovery requires.

Second, missing tacit knowledge in the corpus. The large language models that AI scientists ride on are trained on corpora that fail to capture the tacit procedural and failure knowledge of laboratory practice. The paper points to the limit of a literature that records only successes.
Third, preference tuning that collapses diversity. Preference optimisation during post-training compresses output diversity toward consensus, producing an anti-novelty bias that treats originality as a threat.

The paper reinforces this with a "hypothesis hivemind" experiment: frontier models from independent providers converge semantically on interpretive and open-ended hypothesis-generation tasks, showing that diversity compression is not the quirk of a single model.

Fourth, Benchmarks With No Feedback Loop

The fourth flaw Bisht and colleagues name is that most scientific benchmarks measure only single-turn prediction accuracy and lack a feedback loop back to physical experiments. Measurement stops at getting one prediction right, so real experimental results never flow back into the computational model.

Autonomous science runs on a closed loop of hypothesis → experiment → revision. Current benchmarks omit this loop, so even a higher score does not verify the actual "ability to perform discovery."

Comparing the Four Flaws and the Prescription

The four design flaws in the 2026 paper are each tied to a different stage of the scientific discovery process. The table below maps each flaw to the stage it breaks and the direction of fix the paper points to.

Design flaw	Stage it breaks	Direction of fix
McNamara fallacy	Problem selection	Set problems by importance, not measurability
Corpus missing tacit knowledge	Knowledge base	Include failure and procedural knowledge in data
Diversity collapse in preference tuning	Hypothesis generation	Reduce consensus compression, preserve novelty
Benchmarks without feedback	Verification and revision	Adopt closed-loop physical-experiment evaluation

The Thread That Ties the Four Together: Why "Bigger" Isn't the Answer

Read separately, each flaw looks like its own bottleneck; read together, they all point the same way. What scaling mostly buys you is competence inside the training distribution. The four points this paper names all sit outside it: problems that are hard to measure, failures never recorded in the literature, hypotheses that stray from consensus, and results that only an experiment can confirm. A signal that is absent from the data does not appear because you added parameters. That is why the "bigger model" prescription spins its wheels.

This lens also explains why the critique arrives now. That AI scientists are already useful as co-scientists is a signal that in-distribution competence is near saturation. The remaining gap is the part that scale does not close, so the conversation necessarily moves to fundamental design. The four flaws are best read not as an independent bug list but as one question—how do you acquire ability that lies outside the training distribution?—showing up differently at four stages.

Reading the Paper Critically

A strong diagnosis comes with matching caveats. This is a critique-and-position paper rather than a data-heavy one. The structural diagnosis of the four flaws is persuasive, but it does not quantitatively decompose how much each flaw contributes to the autonomy gap. The "hypothesis hivemind" observation is close to the only empirical support, and even there it stays open whether convergence is always bad, and on which tasks and how much it hurts.

The prescriptions, too, are directions rather than methods. "Include failure knowledge in the data" and "adopt closed-loop evaluation" point the right way, but how to obtain that data and which experiments to put in the loop are left to follow-up work. So the piece reads more productively as a reframing that turns attention from the scale race toward design choices, rather than as pessimism that "AI scientists can't work."

Taken together, the conclusion is clear: to turn the AI scientist of 2026 into an autonomous-discovery tool, you must change design choices—problem selection, data, post-training, and evaluation—rather than reach for a bigger model. Just keep in mind that the paper's diagnosis is strong while its concrete solutions are left to follow-up work.

Reference: Agentic AI Scientists Are Not Built For Autonomous Scientific Discovery (Bisht et al., 2026)