Best LLM for Reasoning

Ranked on MMLU-Pro, GPQA, and AIME. Price is a tiebreaker — reasoning quality dominates for reasoning-heavy work.

Updated April 2026. Top 3 this month: DeepSeek: R1 0528, Qwen: Qwen3.5 Plus 2026-02-15, Qwen: Qwen3.5 397B A17B.

How we rank

Reasoning workloads — math, logic, science, multi-step planning — reward the top-tier frontier models disproportionately. The gap between the best and second-best can be a 20-point accuracy swing. We weight reasoning benchmarks heavily and use price only as a tiebreaker.

Pillars and weights: MMLU-Pro (35%) · GPQA (25%) · AIME (20%) · price (20%). Our full methodology is published on the methodology page.

Top ranked models

RankModelProviderInput $/1MOutput $/1MContext
1DeepSeek: R1 0528DeepSeek$0.50$2.15163,840
2Qwen: Qwen3.5 Plus 2026-02-15Qwen$0.26$1.561,000,000
3Qwen: Qwen3.5 397B A17BQwen$0.39$2.34262,144
4MiniMax: MiniMax M2.1MiniMax$0.29$0.95196,608
5Anthropic: Claude Sonnet 4.5Anthropic$3.00$15.001,000,000
6Xiaomi: MiMo-V2-FlashXiaomi$0.09$0.29262,144
7Qwen: Qwen3.5-122B-A10BQwen$0.26$2.08262,144
8Qwen: Qwen3.5-27BQwen$0.20$1.56262,144
9AllenAI: Olmo 3 32B ThinkAllen AI$0.15$0.5065,536
10Qwen: Qwen3.5-35B-A3BQwen$0.16$1.30262,144

Tips for reasoning

  • Turn on native reasoning mode if the model offers it — the accuracy gains are real.
  • Reasoning mode costs more tokens. Budget accordingly.
  • Ensemble a cheap model + a reasoning model behind a router to control cost.

Frequently asked questions

Which LLM reasons best?

As of April 2026, our weighted top 3 for reasoning are DeepSeek: R1 0528, Qwen: Qwen3.5 Plus 2026-02-15, Qwen: Qwen3.5 397B A17B.

Does reasoning mode cost more?

Yes — typically 2–5x in output tokens, occasionally more. Check your billing.

Do smaller models reason?

Not well on frontier benchmarks. For simple chains of thought they can be OK, but multi-step reasoning clearly separates the top tier.

Related tasks

Want to model your own workload? Use the volume and switch-cost calculators on the main tool page. Sign in with Google to unlock compare-my-prompt with real tokenizer counts.

Data refreshed daily via our snapshot cron. See our public JSON API for programmatic access.