Unlimited OCR: A 3B Model That Keeps the KV Cache Constant to Read Dozens of Pages in One Pass
Unlimited OCR is a 3B-parameter OCR model from Baidu researchers that replaces every attention layer in the DeepSeek OCR decoder with Reference Sliding Window Attention (R-SWA), keeping the KV cache at a constant size no matter how long the output grows. Released in June 2026, the technical report "Unlimited OCR Works" (arXiv 2606.23050, Youyang Yin et al.) cuts the problem of a decoder that balloons in memory and slows down as generation lengthens, transcribing dozens of pages of documents in a single forward pass under a standard maximum length of 32K tokens. The weights are released under an MIT license.
Why End-to-End OCR Slows Down on Long Outputs
End-to-end OCR uses an LLM as the decoder to leverage the prior distribution of language, but it carries a clear downside: as the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows generation. The DeepSeek OCR family that exemplifies this approach is powerful at turning document images into text, yet costs rise as the amount handled in one pass grows.
The report contrasts this with human working memory. People show no such efficiency decline during long copying tasks, while a conventional decoder grows steadily slower as tokens accumulate.
What R-SWA Changes: Replacing Decoder Attention Wholesale
Reference Sliding Window Attention (R-SWA) is a new attention that takes DeepSeek OCR as the baseline and replaces all of the decoder's attention layers. R-SWA reduces attention computation cost while maintaining a constant KV cache throughout the entire decoding process.
The core idea is refusing to grow memory without bound. Because the cache stays fixed as output lengthens, the generation curve that used to slow with length flattens out. As its name suggests, the design imitates human parsing working memory.
One Forward Pass, Dozens of Pages, a Standard 32K Length
Unlimited OCR combines the high compression rate of DeepSeek OCR's encoder with R-SWA's constant KV cache to transcribe dozens of pages of documents in a single forward pass under a standard maximum length of 32K tokens. Unlike pipelines that split long documents page by page across many calls, this model is defined by reading long stretches in one pass.
The combination has a clear meaning. The encoder compresses pages into few tokens, and the decoder takes those tokens with constant memory, so efficiency does not collapse as length grows.
Release and Metrics: MIT License and ParseBench
Unlimited OCR is a 3B-parameter model published on Hugging Face under an MIT license and runs on Transformers and SGLang. The model card reports evaluation on llamaindex's ParseBench, scoring 46.17 mean, 86.81 on text content, and 0.97 on text formatting.
Reading these scores calls for care. A high text-content score alongside a low formatting score shows that reading characters and reconstructing layout, tables, and formatting are different axes. OmniDocBench scores and throughput figures cited by some secondary outlets are not confirmed in the primary sources (the arXiv report and the model card), so they are omitted here.
The Bottom Line
The value of Unlimited OCR lies not in topping a new benchmark but in targeting the structural cause of OCR decoders slowing on long outputs with a single fix: holding the KV cache constant. Replacing every decoder attention layer with R-SWA to read dozens of pages in one pass reshapes the cost curve of long-document parsing. That said, the published quantitative metrics are ParseBench-centric, and generalization to other benchmarks or large-scale real use needs further verification.
Reference: Unlimited OCR Works (Youyang Yin et al., Baidu, 2026, arXiv 2606.23050) · Model card (baidu/Unlimited-OCR, MIT)
AI & tech,
delivered fastest
Beyond the headlines — into the context and the structure
Ai Soon As Possible · asapai.co.kr
