Best for: (Vision + Text)

Best Multimodal LLM (Vision + Text)

Ranked on vision benchmark accuracy, context window, and combined per-query cost for image + text workloads.

Updated June 2026. Top 3 this month: Qwen3.5 Plus 2026-02-15, Qwen3.5 397B A17B, GPT-4o (2024-11-20).

Podium

This month’s top three.

1
Qwen3.5 Plus 2026-02-15
Qwen
Input / 1M
$0.26
Output / 1M
$1.56
Context
1,000,000
Model page
2
Qwen3.5 397B A17B
Qwen
Input / 1M
$0.39
Output / 1M
$2.34
Context
262,144
Model page
3
GPT-4o (2024-11-20)
OpenAI
Input / 1M
$2.50
Output / 1M
$10.00
Context
128,000
Model page

How we rank

Weights tuned for (vision + text).

Multimodal workloads — document parsing, screenshot understanding, chart reading — demand a model that can handle dense visual information without hallucinating fields that are not present. We weight MMMU for vision reasoning and DocVQA for document tasks, alongside price because vision tokens are priced differently per provider.

Our full methodology is published on the methodology page.

Pillars and weights:

MMMU40%
DocVQA30%
price30%

Full ranking

Top ranked models

Rank	Model	Provider	Input $/1M	Output $/1M	Context
1	Qwen3.5 Plus 2026-02-15	Qwen	$0.26	$1.56	1,000,000
2	Qwen3.5 397B A17B	Qwen	$0.39	$2.34	262,144
3	GPT-4o (2024-11-20)	OpenAI	$2.50	$10.00	128,000
4	MiniMax-01	MiniMax	$0.20	$1.10	1,000,192
5	Claude Sonnet 4.5	Anthropic	$3.00	$15.00	1,000,000
6	Qwen3.5-122B-A10B	Qwen	$0.26	$2.08	262,144
7	Qwen3.5-27B	Qwen	$0.20	$1.56	262,144
8	Llama 4 Maverick	Meta	$0.15	$0.60	1,048,576
9	Gemma 4 31B	Google	$0.00	$0.00	262,144
10	Gemma 4 31B	Google	$0.13	$0.38	262,144

Field notes

Tips for (vision + text)

01
Verify the provider's image-size and token-per-image policy before committing — pricing varies dramatically.
02
For structured extraction, combine vision with JSON mode.
03
Chart/graph understanding is still fragile. Add a text-only sanity check if the downstream consequence is high.

FAQ

Frequently asked questions

The questions teams ask before picking a model for (vision + text).

Get instant answers from our AI agent

As of June 2026, our weighted top 3 are Qwen3.5 Plus 2026-02-15, Qwen3.5 397B A17B, GPT-4o (2024-11-20).

Usually cheaper per-image than people assume, but high-res detail modes can 10x the cost. Benchmark on your actual images.

Gemini and a few research-tier OpenAI SKUs handle video. The ecosystem is still early — expect to preprocess to frames for most use cases.

About

Insights

Streamline

Integration

Solutions

Healthcare AI

Use Cases

Industries

Best Multimodal LLM (Vision + Text)

This month’s top three.

Weights tuned for (vision + text).

Top ranked models

Tips for (vision + text)

Frequently asked questions

Model your own workload.

Best Multimodal LLM (Vision + Text)

This month’s top three.

Weights tuned for (vision + text).

Top ranked models

Tips for (vision + text)

Frequently asked questions

Related tasks

Model your own workload.