Best for: Coding

Best LLM for Coding

Ranked on SWE-Bench, HumanEval, and dollars-per-1M output tokens. Balanced for autonomous and assistive coding workflows.

Updated July 2026. Top 3 this month: GPT-4o (2024-11-20), Claude Sonnet 4.5, GPT-5 Codex.

Podium

This month’s top three.

1
GPT-4o (2024-11-20)
OpenAI
Input / 1M
$2.50
Output / 1M
$10.00
Context
128,000
Model page
2
Claude Sonnet 4.5
Anthropic
Input / 1M
$3.00
Output / 1M
$15.00
Context
1,000,000
Model page
3
GPT-5 Codex
OpenAI
Input / 1M
$1.25
Output / 1M
$10.00
Context
400,000
Model page

How we rank

Weights tuned for coding.

Choosing an LLM for coding comes down to three things: how well it turns specifications into working code, how well it reasons about large repositories, and how much it will cost once you wire it into CI or an agent loop. We weight SWE-Bench heaviest because it best predicts real-world coding-agent success, followed by HumanEval for short-form correctness, and a price pillar so the recommendation survives contact with a finance review.

Our full methodology is published on the methodology page.

Pillars and weights:

SWE-Bench50%
HumanEval30%
price20%

Full ranking

Top ranked models

Rank	Model	Provider	Input $/1M	Output $/1M	Context
1	GPT-4o (2024-11-20)	OpenAI	$2.50	$10.00	128,000
2	Claude Sonnet 4.5	Anthropic	$3.00	$15.00	1,000,000
3	GPT-5 Codex	OpenAI	$1.25	$10.00	400,000
4	Gemini 2.5 Pro	Google	$1.25	$10.00	1,048,576
5	Gemini 2.5 Pro Preview 06-05	Google	$1.25	$10.00	1,048,576
6	GPT-5.1-Codex	OpenAI	$1.25	$10.00	400,000
7	o3	OpenAI	$2.00	$8.00	200,000
8	Claude 3.7 Sonnet	Anthropic	$3.00	$15.00	200,000
9	Claude 3.7 Sonnet (thinking)	Anthropic	$3.00	$15.00	200,000
10	GPT-5 Mini	OpenAI	$0.25	$2.00	400,000

Field notes

Tips for coding

01
Prefer a model with a large context window if your repo is bigger than ~200 files.
02
Use batch pricing for CI / nightly refactor jobs; interactive IDE work stays on the standard price.
03
Check function-calling reliability before committing to an agentic flow.

FAQ

Frequently asked questions

The questions teams ask before picking a model for coding.

Get instant answers from our AI agent

As of July 2026, our weighted top 3 are GPT-4o (2024-11-20), Claude Sonnet 4.5, GPT-5 Codex.

Claude wins long-horizon refactoring; GPT wins short-burst correctness. The right answer depends on your workload mix — see the scoring pillars below.

Rarely. Fine-tuning on proprietary code still helps, but for 90% of shops a strong frontier model with RAG over the repo gets you most of the way.

DeepSeek and Meta Llama variants are competitive on price. We list their hosted pricing here; self-host economics live in our Shadow AI audit tool.

About

Insights

Streamline

Integration

Solutions

Healthcare AI

Use Cases

Industries

Best LLM for Coding

This month’s top three.

Weights tuned for coding.

Top ranked models

Tips for coding

Frequently asked questions

Model your own workload.

Best LLM for Coding

This month’s top three.

Weights tuned for coding.

Top ranked models

Tips for coding

Frequently asked questions

Related tasks

Model your own workload.