Best LLM for Reasoning
Ranked on MMLU-Pro, GPQA, and AIME. Price is a tiebreaker — reasoning quality dominates for reasoning-heavy work.
Updated April 2026. Top 3 this month: DeepSeek: R1 0528, Qwen: Qwen3.5 Plus 2026-02-15, Qwen: Qwen3.5 397B A17B.
How we rank
Reasoning workloads — math, logic, science, multi-step planning — reward the top-tier frontier models disproportionately. The gap between the best and second-best can be a 20-point accuracy swing. We weight reasoning benchmarks heavily and use price only as a tiebreaker.
Pillars and weights: MMLU-Pro (35%) · GPQA (25%) · AIME (20%) · price (20%). Our full methodology is published on the methodology page.
Top ranked models
| Rank | Model | Provider | Input $/1M | Output $/1M | Context |
|---|---|---|---|---|---|
| 1 | DeepSeek: R1 0528 | DeepSeek | $0.50 | $2.15 | 163,840 |
| 2 | Qwen: Qwen3.5 Plus 2026-02-15 | Qwen | $0.26 | $1.56 | 1,000,000 |
| 3 | Qwen: Qwen3.5 397B A17B | Qwen | $0.39 | $2.34 | 262,144 |
| 4 | MiniMax: MiniMax M2.1 | MiniMax | $0.29 | $0.95 | 196,608 |
| 5 | Anthropic: Claude Sonnet 4.5 | Anthropic | $3.00 | $15.00 | 1,000,000 |
| 6 | Xiaomi: MiMo-V2-Flash | Xiaomi | $0.09 | $0.29 | 262,144 |
| 7 | Qwen: Qwen3.5-122B-A10B | Qwen | $0.26 | $2.08 | 262,144 |
| 8 | Qwen: Qwen3.5-27B | Qwen | $0.20 | $1.56 | 262,144 |
| 9 | AllenAI: Olmo 3 32B Think | Allen AI | $0.15 | $0.50 | 65,536 |
| 10 | Qwen: Qwen3.5-35B-A3B | Qwen | $0.16 | $1.30 | 262,144 |
Tips for reasoning
- Turn on native reasoning mode if the model offers it — the accuracy gains are real.
- Reasoning mode costs more tokens. Budget accordingly.
- Ensemble a cheap model + a reasoning model behind a router to control cost.
Frequently asked questions
Which LLM reasons best?
As of April 2026, our weighted top 3 for reasoning are DeepSeek: R1 0528, Qwen: Qwen3.5 Plus 2026-02-15, Qwen: Qwen3.5 397B A17B.
Does reasoning mode cost more?
Yes — typically 2–5x in output tokens, occasionally more. Check your billing.
Do smaller models reason?
Not well on frontier benchmarks. For simple chains of thought they can be OK, but multi-step reasoning clearly separates the top tier.
Related tasks
Want to model your own workload? Use the volume and switch-cost calculators on the main tool page. Sign in with Google to unlock compare-my-prompt with real tokenizer counts.
Data refreshed daily via our snapshot cron. See our public JSON API for programmatic access.