Best for: RAG (Retrieval-Augmented Generation)

Best LLM for RAG (Retrieval-Augmented Generation)

Ranked on long-context accuracy, groundedness, and input-token price — RAG is input-token-heavy by design.

Updated June 2026. Top 3 this month: R1 0528, Hunyuan A13B Instruct, DeepSeek V3.

Podium

This month’s top three.

1
R1 0528
DeepSeek
Input / 1M
$0.50
Output / 1M
$2.15
Context
163,840
Model page
2
Hunyuan A13B Instruct
Tencent
Input / 1M
$0.14
Output / 1M
$0.57
Context
131,072
Model page
3
DeepSeek V3
DeepSeek
Input / 1M
$0.32
Output / 1M
$0.89
Context
163,840
Model page

How we rank

Weights tuned for rag (retrieval-augmented generation).

RAG workloads push enormous amounts of retrieved context through a model. The three things that matter: does it faithfully use what you retrieved (groundedness), does it degrade when the context is long (needle-in-a-haystack), and how much will a million input tokens cost you. Because RAG is input-heavy, the input price pillar gets a heavier weight than it does for agentic or generative workloads.

Our full methodology is published on the methodology page.

Pillars and weights:

Long-context accuracy50%
MMLU20%
input price30%

Full ranking

Top ranked models

Rank	Model	Provider	Input $/1M	Output $/1M	Context
1	R1 0528	DeepSeek	$0.50	$2.15	163,840
2	Hunyuan A13B Instruct	Tencent	$0.14	$0.57	131,072
3	DeepSeek V3	DeepSeek	$0.32	$0.89	163,840
4	Qwen3.5 Plus 2026-02-15	Qwen	$0.26	$1.56	1,000,000
5	Trinity Large Preview	Arcee AI	$0.00	$0.00	131,000
6	MiniMax M2.1	MiniMax	$0.29	$0.95	196,608
7	Qwen3.5 397B A17B	Qwen	$0.39	$2.34	262,144
8	MiMo-V2-Flash	Xiaomi	$0.09	$0.29	262,144
9	MiniMax-01	MiniMax	$0.20	$1.10	1,000,192
10	Llama 3.3 70B Instruct	Meta	$0.12	$0.38	131,072

Field notes

Tips for rag (retrieval-augmented generation)

01
A 1M+ token context window is usually overkill. Optimize retrieval quality first.
02
Prompt caching matters: pin the system prompt and retrieved context into the cache tier if available.
03
Use batch pricing for bulk backfills over your corpus.

FAQ

Frequently asked questions

The questions teams ask before picking a model for rag (retrieval-augmented generation).

Get instant answers from our AI agent

As of June 2026, our weighted top 3 are R1 0528, Hunyuan A13B Instruct, DeepSeek V3.

Almost never. Most RAG systems send 10–50k tokens per query. A 200k context is plenty; a 1M context is a nice-to-have for edge cases.

A lot. If your retrieved context has repeating chunks — documentation, policy, FAQs — cached-input pricing can cut your bill by 70–80%.

For ambiguous queries, yes. For lookup-style queries, it just adds cost without improving grounding.

About

Insights

Streamline

Integration

Solutions

Healthcare AI

Use Cases

Industries

Best LLM for RAG (Retrieval-Augmented Generation)

This month’s top three.

Weights tuned for rag (retrieval-augmented generation).

Top ranked models

Tips for rag (retrieval-augmented generation)

Frequently asked questions

Model your own workload.

Best LLM for RAG (Retrieval-Augmented Generation)

This month’s top three.

Weights tuned for rag (retrieval-augmented generation).

Top ranked models

Tips for rag (retrieval-augmented generation)

Frequently asked questions

Related tasks

Model your own workload.