What is VibeThinker-3B?

It is a 3-billion-parameter reasoning model from Weibo (Sina Weibo), released in June 2026 as arXiv 2606.16140. A 'Spectrum-to-Signal' post-training pipeline (curriculum SFT, multi-domain RL, offline self-distillation) elicits verifiable reasoning from the small model, which scores 94.3 on AIME 2026 and 80.2 Pass@1 on LiveCodeBench v6.

Does it really beat much larger models?

The abstract claims parity with or superiority to DeepSeek V3.2, GLM-5, and Gemini 3 Pro on math and coding reasoning. But this holds for narrow benchmarks like AIME and LiveCodeBench, not for general dialogue, knowledge, or safety. Some secondary outlets extended the comparison to Claude Opus 4.5, reigniting a benchmark debate.

What are the main benchmark scores?

AIME 2026: 94.3 (97.1 with test-time scaling), AIME 2025: 91.4, HMMT 2025: 89.3, LiveCodeBench v6: 80.2 Pass@1, IFEval: 93.4, and a 96.1% acceptance rate on unseen LeetCode contests.

VibeThinker-3B: Weibo's 3-Billion-Parameter Model Scores 94.3 on AIME 2026

VibeThinker-3B is a 3-billion-parameter reasoning model from Weibo (Sina Weibo) that scores 94.3 on the AIME 2026 math benchmark. Released in June 2026 as the arXiv technical report 2606.16140, the paper argues that a "Spectrum-to-Signal" post-training pipeline can elicit large-model verifiable reasoning from a small model. The authors report that the same model matches or exceeds flagship models hundreds of times larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro, on math and coding reasoning.

What VibeThinker-3B Trains and How

VibeThinker-3B pairs a small 3-billion-parameter size with a post-training pipeline called "Spectrum-to-Signal." The pipeline runs in three stages. First, curriculum-based supervised fine-tuning (SFT) raises difficulty gradually to secure broad solution diversity.

Next, multi-domain reinforcement learning (RL) across math, coding, and other domains amplifies the verifiable correct-answer signal among the diverse candidate solutions spread out earlier. The final stage, offline self-distillation, compresses the strengthened ability back into the model. The authors frame the underlying intuition as a "Parametric Compression-Coverage Hypothesis," arguing that small models benefit from widening solution coverage first and then narrowing toward correct answers.

Benchmark Results: AIME 94.3, LiveCodeBench 80.2

VibeThinker-3B's headline numbers concentrate on math competitions such as AIME 2026 and coding tasks such as LiveCodeBench v6, scoring 94.3 and 80.2 Pass@1 respectively. The main reported scores are as follows.

AIME 2026 math: 94.3 (97.1 with claim-level test-time scaling).
AIME 2025: 91.4, and HMMT 2025: 89.3.
LiveCodeBench v6: 80.2 Pass@1.
Unseen LeetCode weekly and biweekly contests from late April to late May 2026: 96.1% acceptance rate.

On IFEval, which measures instruction following, VibeThinker-3B scored 93.4. These figures show a 3-billion-parameter model approaching large-model results in verifiable domains such as math and code.

The "On Par With Giants" Claim and the Benchmark Debate

VibeThinker-3B's most provocative element is its performance-per-parameter claim. The paper's abstract states that the model matches or exceeds flagship models hundreds of times larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. Some secondary outlets stretched the comparison to include Claude Opus 4.5, running headlines like "a 3B model that beats Opus."

This comparison reignited a benchmark debate. VentureBeat relayed criticism that a small model's benchmark edge is confined to narrow math and coding tasks like AIME and LiveCodeBench, and that heavy post-training optimization on those same tasks can widen the gap with general capability. So "beating the giants" reads accurately as a result on specific reasoning benchmarks, not parity across all domains.

Why It Matters: Small, Verifiable, Open-Weight Reasoning

VibeThinker-3B's significance lies in the direction it points, not a single top score: small models can be pushed toward stronger verifiable reasoning. As of 2026, reasoning performance was largely assumed to scale with parameter count and training budget, so a 3-billion-parameter model reaching 94.3 on AIME cracks that assumption.

The practical implication is that post-training built on verifiable rewards can sharply raise a small model's reasoning. That said, the reported strengths concentrate in domains where answers can be graded, such as math and code, and generalization to open-ended dialogue, knowledge, and safety needs further verification.

Summary

VibeThinker-3B scores 94.3 on AIME 2026 and 80.2 Pass@1 on LiveCodeBench v6 at just 3 billion parameters, raising the bar for verifiable reasoning in small models through "Spectrum-to-Signal" post-training. The paper claims parity with DeepSeek V3.2, GLM-5, and Gemini 3 Pro on math and coding, but that holds for narrow reasoning benchmarks and should not be generalized to all-domain parity.

Reference: VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models (Sen Xu et al., Weibo, 2026)

VibeThinker-3B: Weibo's 3-Billion-Parameter Model Scores 94.3 on AIME 2026

What VibeThinker-3B Trains and How

Benchmark Results: AIME 94.3, LiveCodeBench 80.2

The "On Par With Giants" Claim and the Benchmark Debate

Why It Matters: Small, Verifiable, Open-Weight Reasoning

Summary

Related posts

AI & tech,delivered fastest

AI & tech,
delivered fastest