ASAPAi Soon As Possible · AI & tech, delivered fastest
Article

VibeThinker-3B: Weibo's 3-Billion-Parameter Model Scores 94.3 on AIME 2026

AASAP
2026-06-24 · 3 min read

VibeThinker-3B is a 3-billion-parameter reasoning model from Weibo (Sina Weibo) that scores 94.3 on the AIME 2026 math benchmark. Released in June 2026 as the arXiv technical report 2606.16140, the paper argues that a "Spectrum-to-Signal" post-training pipeline can elicit large-model verifiable reasoning from a small model. The authors report that the same model matches or exceeds flagship models hundreds of times larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro, on math and coding reasoning.

What VibeThinker-3B Trains and How

VibeThinker-3B pairs a small 3-billion-parameter size with a post-training pipeline called "Spectrum-to-Signal." The pipeline runs in three stages. First, curriculum-based supervised fine-tuning (SFT) raises difficulty gradually to secure broad solution diversity.

Next, multi-domain reinforcement learning (RL) across math, coding, and other domains amplifies the verifiable correct-answer signal among the diverse candidate solutions spread out earlier. The final stage, offline self-distillation, compresses the strengthened ability back into the model. The authors frame the underlying intuition as a "Parametric Compression-Coverage Hypothesis," arguing that small models benefit from widening solution coverage first and then narrowing toward correct answers.

Benchmark Results: AIME 94.3, LiveCodeBench 80.2

VibeThinker-3B's headline numbers concentrate on math competitions such as AIME 2026 and coding tasks such as LiveCodeBench v6, scoring 94.3 and 80.2 Pass@1 respectively. The main reported scores are as follows.

  1. AIME 2026 math: 94.3 (97.1 with claim-level test-time scaling).
  2. AIME 2025: 91.4, and HMMT 2025: 89.3.
  3. LiveCodeBench v6: 80.2 Pass@1.
  4. Unseen LeetCode weekly and biweekly contests from late April to late May 2026: 96.1% acceptance rate.

On IFEval, which measures instruction following, VibeThinker-3B scored 93.4. These figures show a 3-billion-parameter model approaching large-model results in verifiable domains such as math and code.

The "On Par With Giants" Claim and the Benchmark Debate

VibeThinker-3B's most provocative element is its performance-per-parameter claim. The paper's abstract states that the model matches or exceeds flagship models hundreds of times larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. Some secondary outlets stretched the comparison to include Claude Opus 4.5, running headlines like "a 3B model that beats Opus."

This comparison reignited a benchmark debate. VentureBeat relayed criticism that a small model's benchmark edge is confined to narrow math and coding tasks like AIME and LiveCodeBench, and that heavy post-training optimization on those same tasks can widen the gap with general capability. So "beating the giants" reads accurately as a result on specific reasoning benchmarks, not parity across all domains.

Why It Matters: Small, Verifiable, Open-Weight Reasoning

VibeThinker-3B's significance lies in the direction it points, not a single top score: small models can be pushed toward stronger verifiable reasoning. As of 2026, reasoning performance was largely assumed to scale with parameter count and training budget, so a 3-billion-parameter model reaching 94.3 on AIME cracks that assumption.

The practical implication is that post-training built on verifiable rewards can sharply raise a small model's reasoning. That said, the reported strengths concentrate in domains where answers can be graded, such as math and code, and generalization to open-ended dialogue, knowledge, and safety needs further verification.

Summary

VibeThinker-3B scores 94.3 on AIME 2026 and 80.2 Pass@1 on LiveCodeBench v6 at just 3 billion parameters, raising the bar for verifiable reasoning in small models through "Spectrum-to-Signal" post-training. The paper claims parity with DeepSeek V3.2, GLM-5, and Gemini 3 Pro on math and coding, but that holds for narrow reasoning benchmarks and should not be generalized to all-domain parity.


Reference: VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models (Sen Xu et al., Weibo, 2026)

ASAP

AI & tech,
delivered fastest

Beyond the headlines — into the context and the structure

Ai Soon As Possible · asapai.co.kr

AI TOP 100 (CAMPUS) 2026 finalist badge
← All posts