What does Unlimited OCR's R-SWA change?

Reference Sliding Window Attention (R-SWA) takes DeepSeek OCR as the baseline and replaces all of the decoder's attention layers. It reduces attention computation cost while maintaining a constant KV cache throughout decoding, so memory does not balloon and generation does not slow as the output lengthens.

How long a document can it read at once?

By combining DeepSeek OCR's high-compression encoder with R-SWA's constant KV cache, it transcribes dozens of pages of documents in a single forward pass under a standard maximum length of 32K tokens.

How was it released and how does it perform?

It is a 3B-parameter model published on Hugging Face under an MIT license, running on Transformers and SGLang. The model card reports llamaindex ParseBench scores of 46.17 mean, 86.81 text content, and 0.97 text formatting. OmniDocBench and throughput figures from some secondary outlets are unconfirmed in the primary sources and are omitted.

Unlimited OCR: A 3B Model That Keeps the KV Cache Constant to Read Dozens of Pages in One Pass

Unlimited OCR is a 3B-parameter OCR model from Baidu researchers that replaces every attention layer in the DeepSeek OCR decoder with Reference Sliding Window Attention (R-SWA), keeping the KV cache at a constant size no matter how long the output grows. Released in June 2026, the technical report "Unlimited OCR Works" (arXiv 2606.23050, Youyang Yin et al.) cuts the problem of a decoder that balloons in memory and slows down as generation lengthens, transcribing dozens of pages of documents in a single forward pass under a standard maximum length of 32K tokens. The weights are released under an MIT license.

Why End-to-End OCR Slows Down on Long Outputs

End-to-end OCR uses an LLM as the decoder to leverage the prior distribution of language, but it carries a clear downside: as the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows generation. The DeepSeek OCR family that exemplifies this approach is powerful at turning document images into text, yet costs rise as the amount handled in one pass grows.

The report contrasts this with human working memory. People show no such efficiency decline during long copying tasks, while a conventional decoder grows steadily slower as tokens accumulate.

What R-SWA Changes: Replacing Decoder Attention Wholesale

Reference Sliding Window Attention (R-SWA) is a new attention that takes DeepSeek OCR as the baseline and replaces all of the decoder's attention layers. R-SWA reduces attention computation cost while maintaining a constant KV cache throughout the entire decoding process.

The core idea is refusing to grow memory without bound. Because the cache stays fixed as output lengthens, the generation curve that used to slow with length flattens out. As its name suggests, the design imitates human parsing working memory.

One Forward Pass, Dozens of Pages, a Standard 32K Length

Unlimited OCR combines the high compression rate of DeepSeek OCR's encoder with R-SWA's constant KV cache to transcribe dozens of pages of documents in a single forward pass under a standard maximum length of 32K tokens. Unlike pipelines that split long documents page by page across many calls, this model is defined by reading long stretches in one pass.

The combination has a clear meaning. The encoder compresses pages into few tokens, and the decoder takes those tokens with constant memory, so efficiency does not collapse as length grows.

Release and Metrics: MIT License and ParseBench

Unlimited OCR is a 3B-parameter model published on Hugging Face under an MIT license and runs on Transformers and SGLang. The model card reports evaluation on llamaindex's ParseBench, scoring 46.17 mean, 86.81 on text content, and 0.97 on text formatting.

Reading these scores calls for care. A high text-content score alongside a low formatting score shows that reading characters and reconstructing layout, tables, and formatting are different axes. OmniDocBench scores and throughput figures cited by some secondary outlets are not confirmed in the primary sources (the arXiv report and the model card), so they are omitted here.

The Bottom Line

The value of Unlimited OCR lies not in topping a new benchmark but in targeting the structural cause of OCR decoders slowing on long outputs with a single fix: holding the KV cache constant. Replacing every decoder attention layer with R-SWA to read dozens of pages in one pass reshapes the cost curve of long-document parsing. That said, the published quantitative metrics are ParseBench-centric, and generalization to other benchmarks or large-scale real use needs further verification.

Reference: Unlimited OCR Works (Youyang Yin et al., Baidu, 2026, arXiv 2606.23050) · Model card (baidu/Unlimited-OCR, MIT)

Unlimited OCR: A 3B Model That Keeps the KV Cache Constant to Read Dozens of Pages in One Pass

Why End-to-End OCR Slows Down on Long Outputs

What R-SWA Changes: Replacing Decoder Attention Wholesale

One Forward Pass, Dozens of Pages, a Standard 32K Length

Release and Metrics: MIT License and ParseBench

The Bottom Line

Related posts

AI & tech,delivered fastest

AI & tech,
delivered fastest