VibeThinker-3B: Weibo's 3-Billion-Parameter Model Scores 94.3 on AIME 2026
VibeThinker-3B is a 3-billion-parameter reasoning model from Weibo (Sina Weibo) that scores 94.3 on the AIME 2026 math benchmark. Released in June 2026 as the arXiv technical report 2606.16140, the paper argues that a "Spectrum-to-Signal" post-training pipeline can elicit large-model verifiable reasoning from a small model. The authors report that the same model matches or exceeds flagship models hundreds of times larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro, on math and coding reasoning.
What VibeThinker-3B Trains and How
VibeThinker-3B pairs a small 3-billion-parameter size with a post-training pipeline called "Spectrum-to-Signal." The pipeline runs in three stages. First, curriculum-based supervised fine-tuning (SFT) raises difficulty gradually to secure broad solution diversity.
Next, multi-domain reinforcement learning (RL) across math, coding, and other domains amplifies the verifiable correct-answer signal among the diverse candidate solutions spread out earlier. The final stage, offline self-distillation, compresses the strengthened ability back into the model. The authors frame the underlying intuition as a "Parametric Compression-Coverage Hypothesis," arguing that small models benefit from widening solution coverage first and then narrowing toward correct answers.
Benchmark Results: AIME 94.3, LiveCodeBench 80.2
VibeThinker-3B's headline numbers concentrate on math competitions such as AIME 2026 and coding tasks such as LiveCodeBench v6, scoring 94.3 and 80.2 Pass@1 respectively. The main reported scores are as follows.
- AIME 2026 math: 94.3 (97.1 with claim-level test-time scaling).
- AIME 2025: 91.4, and HMMT 2025: 89.3.
- LiveCodeBench v6: 80.2 Pass@1.
- Unseen LeetCode weekly and biweekly contests from late April to late May 2026: 96.1% acceptance rate.
On IFEval, which measures instruction following, VibeThinker-3B scored 93.4. These figures show a 3-billion-parameter model approaching large-model results in verifiable domains such as math and code.
The "On Par With Giants" Claim and the Benchmark Debate
VibeThinker-3B's most provocative element is its performance-per-parameter claim. The paper's abstract states that the model matches or exceeds flagship models hundreds of times larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. Some secondary outlets stretched the comparison to include Claude Opus 4.5, running headlines like "a 3B model that beats Opus."
This comparison reignited a benchmark debate. VentureBeat relayed criticism that a small model's benchmark edge is confined to narrow math and coding tasks like AIME and LiveCodeBench, and that heavy post-training optimization on those same tasks can widen the gap with general capability. So "beating the giants" reads accurately as a result on specific reasoning benchmarks, not parity across all domains.
Why It Matters: Small, Verifiable, Open-Weight Reasoning
VibeThinker-3B's significance lies in the direction it points, not a single top score: small models can be pushed toward stronger verifiable reasoning. As of 2026, reasoning performance was largely assumed to scale with parameter count and training budget, so a 3-billion-parameter model reaching 94.3 on AIME cracks that assumption.
The practical implication is that post-training built on verifiable rewards can sharply raise a small model's reasoning. That said, the reported strengths concentrate in domains where answers can be graded, such as math and code, and generalization to open-ended dialogue, knowledge, and safety needs further verification.
Summary
VibeThinker-3B scores 94.3 on AIME 2026 and 80.2 Pass@1 on LiveCodeBench v6 at just 3 billion parameters, raising the bar for verifiable reasoning in small models through "Spectrum-to-Signal" post-training. The paper claims parity with DeepSeek V3.2, GLM-5, and Gemini 3 Pro on math and coding, but that holds for narrow reasoning benchmarks and should not be generalized to all-domain parity.
AI & tech,
delivered fastest
Beyond the headlines — into the context and the structure
Ai Soon As Possible · asapai.co.kr
