The longer the document, the more AI fabricates: a 172-billion-token hallucination study

AI hallucination is sharper the longer the context grows in document question answering. A March 2026 paper, "How Much Do LLMs Hallucinate in Document Q&A," evaluated 35 open-weight models across 172 billion tokens. At 32K tokens even top models fabricated 5~7%, and at 200K tokens every model exceeded 10%. The key point is that the ability to find facts and the ability not to fabricate are separate. ASAP summarizes the result from the primary source.

At 200K tokens every model exceeded 10%

The paper found that hallucination rises sharply as context grows longer. At 32K tokens top models fabricated 5~7%, but that nearly tripled at 128K, and at 200K all 35 models exceeded 10%. It is a measurement at the scale of 172 billion tokens.

"Finding facts" and "not fabricating" are different

The paper's key finding is that the ability to find evidence and the ability not to fabricate are separate. A model good at locating facts can still generate falsehoods. That means the two abilities must be measured separately.

The temperature paradox

There is a paradox in the temperature setting too. Temperature 0.0 gave the best accuracy in about 60% of cases, but coherence loss happened 48 times more often than at temperature 1.0. For most models, higher temperature actually reduced fabrication.

It was independent of hardware

The results were consistent across all three hardware platforms tested in the study. Hallucination rates were similar on Nvidia H200, AMD MI300X, and Intel Gaudi 3. That means you need not pick specific hardware to reduce hallucination.

What it means: do not blindly trust long-context RAG

The paper shows with numbers that long-document Q&A should not be trusted blindly. The longer the context you stuff in, the more AI fabricates, so feeding only the key evidence, kept short, is safer. In retrieval-augmented generation (RAG), more is not the answer.

Wrap-up

"How Much Do LLMs Hallucinate in Document Q&A" proves, across 172 billion tokens, that hallucination rises with longer context. Over 10% for all models at 200K, finding facts being separate from not fabricating, and hardware independence are the core. Long-document RAG should feed only the essentials, kept short.

Source: ASAP summary of "How Much Do LLMs Hallucinate in Document Q&A Scenarios?" (arXiv 2603.08274, March 9, 2026; JV Roig, 35 open-weight models, 172 billion tokens, top models fabricating 5~7% at 32K and all exceeding 10% at 200K, temperature 0.0 most accurate but 48x more coherence loss, consistent across H200, MI300X, and Gaudi 3).