Rethinking RL for LLM Reasoning: Its Real Job Is Selecting Answers, Not Learning New Capabilities

The widespread belief that reinforcement learning (RLVR) teaches LLMs new reasoning capabilities is refuted by token-level analysis. In a May 2026 paper, a USC and DEVCOM ARL team led by Ömer Faruk Akgül, Willie Neiswanger, and Viktor Prasanna shows that RLVR does not teach new strategies; it redistributes probability mass over solutions the base model already contains, and the entire benefit concentrates on the 1–3% of tokens that are high-entropy "fork" positions. This article lays out that evidence and its implications in a direct-answer format.

What RLVR Actually Changes: Distribution, Not Capability

RLVR is not training that injects new capability but a procedure that reshuffles an existing distribution. Reinforcement learning with verifiable rewards (RLVR) is the dominant paradigm for improving LLM reasoning as of 2026, yet the USC team's token-level analysis shows that RL does not create strategies the base model lacked.

The core word is "redistribution." RL merely shifts probability mass onto solutions the base model already contains, and after training the promoted token almost always lies within the base model's top-5 alternatives. In other words, the correct answer was inside the model from the start, and RL simply selected and elevated it.

Why This Finding Matters: The Learning Story Flips

The force of this result is not a mere optimization tip but a reversal of narrative. The field has long believed that RL makes a model "smarter," and on that premise it justified the enormous cost of online rollouts. But if the promoted token always sits inside the base model's top-5, then what RL does looks less like expanding intelligence and more like an alignment step that funnels votes toward one of the choices that already exist.

This lens also changes how a benchmark bump should be read. A higher score does not mean a new capability appeared; it means the model now surfaces an answer it could always produce more often. The moment you separate "capability" from "accessibility," RL can be redefined not as capacity building but as a decoding-policy problem.

Where the Effect Concentrates: 1–3% Fork Tokens

RL's benefit is confined to a small set of positions that are only 1–3% of all tokens. Across multiple model families and RL algorithms, the researchers found that RL's beneficial changes cluster sparsely at high-entropy decision points where the model is uncertain which branch to take.

These fork tokens have three defining traits.

Sparsity: changes occur at only 1–3% of token positions.
Predictability: the promoted token always lies within the base model's top-5.
Causality: correcting just those few positions recovers a large fraction of RL's accuracy gain.

ReasonMaxxer: Matching RL Without RL

ReasonMaxxer is a contrastive method that touches only fork tokens and matches full RL without any reinforcement learning. Because the base model's own entropy identifies the positions needing correction—without any RL-trained model—it runs on a few hundred base-model rollouts and entropy gating, with no online generation.

The cost difference is decisive. ReasonMaxxer rivals full RL with only tens of problems and a few minutes of single-GPU training, cutting training cost by roughly three orders of magnitude (about 1,000x). The correction is low-dimensional, representable in a tiny fraction of model parameters, which further supports the conclusion that this is policy selection, not capability learning.

How Full RL and ReasonMaxxer Differ

Full RLVR and ReasonMaxxer are sharply different in cost and mechanism. They target the same reasoning improvement, but what they "learn" is fundamentally different.

Aspect	Full RLVR	ReasonMaxxer
Mechanism	Reward-based policy optimization	Contrastive correction at fork tokens
Target tokens	Entire sequence	High-entropy 1–3%
Online generation	Required	Not required
Training cost	Baseline	~1,000x cheaper
New capability learning	Assumed by convention	Unnecessary (already in top-5)

How to Read the "1,000x" Number

The roughly 1,000x saving has to be read conditionally to avoid overstating it. That figure comes from a specific setup—tens of problems and a few minutes on a single GPU—and the scale is achievable because online rollouts are removed entirely. So it reflects less a three-order-of-magnitude jump in algorithmic efficiency than evidence that RL was already spending most of its compute re-confirming answers the model already knew.

For a lab or startup, this number lands especially hard. Without a large GPU cluster, there is now room to lift reasoning performance using only a base model, a few hundred rollouts, and entropy gating. For small teams for whom online RL infrastructure was the barrier to entry, the low-cost path becomes a real option.

Implications: What You See Once You Drop the "Capability Myth"

The implication of the USC study is a redefinition of RL as sparse policy selection rather than capability injection. If the essence of reasoning improvement is selecting answers the base model already holds, then a low-cost path opens: correcting fork points directly instead of running expensive online RL.

One caveat remains. This conclusion was observed on reasoning tasks with verifiable rewards and specific model families, so it cannot be assumed to generalize to every domain. Even so, it is significant for shaking the dominant 2026 assumption that "RL teaches new capabilities" with token-level evidence.

Reference: Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning (Akgül et al., 2026)