ASAPAi Soon As Possible · AI & tech, delivered fastest
Article

Rethinking RL for LLM Reasoning: Its Real Job Is Selecting Answers, Not Learning New Capabilities

2026-06-19 · 3 min read

The widespread belief that reinforcement learning (RLVR) teaches LLMs new reasoning capabilities is refuted by token-level analysis. In a May 2026 paper, a USC and DEVCOM ARL team led by Ömer Faruk Akgül, Willie Neiswanger, and Viktor Prasanna shows that RLVR does not teach new strategies; it redistributes probability mass over solutions the base model already contains, and the entire benefit concentrates on the 1–3% of tokens that are high-entropy "fork" positions. This article lays out that evidence and its implications in a direct-answer format.

What RLVR Actually Changes: Distribution, Not Capability

RLVR is not training that injects new capability but a procedure that reshuffles an existing distribution. Reinforcement learning with verifiable rewards (RLVR) is the dominant paradigm for improving LLM reasoning as of 2026, yet the USC team's token-level analysis shows that RL does not create strategies the base model lacked.

The core word is "redistribution." RL merely shifts probability mass onto solutions the base model already contains, and after training the promoted token almost always lies within the base model's top-5 alternatives. In other words, the correct answer was inside the model from the start, and RL simply selected and elevated it.

Where the Effect Concentrates: 1–3% Fork Tokens

RL's benefit is confined to a small set of positions that are only 1–3% of all tokens. Across multiple model families and RL algorithms, the researchers found that RL's beneficial changes cluster sparsely at high-entropy decision points where the model is uncertain which branch to take.

These fork tokens have three defining traits.

  1. Sparsity: changes occur at only 1–3% of token positions.
  2. Predictability: the promoted token always lies within the base model's top-5.
  3. Causality: correcting just those few positions recovers a large fraction of RL's accuracy gain.

ReasonMaxxer: Matching RL Without RL

ReasonMaxxer is a contrastive method that touches only fork tokens and matches full RL without any reinforcement learning. Because the base model's own entropy identifies the positions needing correction—without any RL-trained model—it runs on a few hundred base-model rollouts and entropy gating, with no online generation.

The cost difference is decisive. ReasonMaxxer rivals full RL with only tens of problems and a few minutes of single-GPU training, cutting training cost by roughly three orders of magnitude (about 1,000x). The correction is low-dimensional, representable in a tiny fraction of model parameters, which further supports the conclusion that this is policy selection, not capability learning.

How Full RL and ReasonMaxxer Differ

Full RLVR and ReasonMaxxer are sharply different in cost and mechanism. They target the same reasoning improvement, but what they "learn" is fundamentally different.

AspectFull RLVRReasonMaxxer
MechanismReward-based policy optimizationContrastive correction at fork tokens
Target tokensEntire sequenceHigh-entropy 1–3%
Online generationRequiredNot required
Training costBaseline~1,000x cheaper
New capability learningAssumed by conventionUnnecessary (already in top-5)

Implications: What You See Once You Drop the "Capability Myth"

The implication of the USC study is a redefinition of RL as sparse policy selection rather than capability injection. If the essence of reasoning improvement is selecting answers the base model already holds, then a low-cost path opens: correcting fork points directly instead of running expensive online RL.

One caveat remains. This conclusion was observed on reasoning tasks with verifiable rewards and specific model families, so it cannot be assumed to generalize to every domain. Even so, it is significant for shaking the dominant 2026 assumption that "RL teaches new capabilities" with token-level evidence.


Reference: Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning (Akgül et al., 2026)

← All posts