Idéal pour : (Vision + Text)

Best Multimodal LLM (Vision + Text)

Ranked on vision benchmark accuracy, context window, and combined per-query cost for image + text workloads.

Mis à jour July 2026. Top 3 ce mois-ci : Qwen3.5 Plus 2026-02-15, Qwen3.5 397B A17B, GPT-4o (2024-11-20).

Podium

This month’s top three.

1
Qwen3.5 Plus 2026-02-15
Qwen
Input / 1M
$0.26
Output / 1M
$1.56
Context
1,000,000
Model page
2
Qwen3.5 397B A17B
Qwen
Input / 1M
$0.39
Output / 1M
$2.34
Context
262,144
Model page
3
GPT-4o (2024-11-20)
OpenAI
Input / 1M
$2.50
Output / 1M
$10.00
Context
128,000
Model page

Notre classement

Weights tuned for (vision + text).

Multimodal workloads — document parsing, screenshot understanding, chart reading — demand a model that can handle dense visual information without hallucinating fields that are not present. We weight MMMU for vision reasoning and DocVQA for document tasks, alongside price because vision tokens are priced differently per provider.

Our full methodology is published on the page méthodologie.

Piliers et poids :

MMMU40%
DocVQA30%
price30%

Full ranking

Modèles les mieux classés

Rang	Modèle	Fournisseur	Entrée $/1M	Sortie $/1M	Contexte
1	Qwen3.5 Plus 2026-02-15	Qwen	$0.26	$1.56	1,000,000
2	Qwen3.5 397B A17B	Qwen	$0.39	$2.34	262,144
3	GPT-4o (2024-11-20)	OpenAI	$2.50	$10.00	128,000
4	MiniMax-01	MiniMax	$0.20	$1.10	1,000,192
5	Claude Sonnet 4.5	Anthropic	$3.00	$15.00	1,000,000
6	Qwen3.5-122B-A10B	Qwen	$0.26	$2.08	262,144
7	Qwen3.5-27B	Qwen	$0.20	$1.56	262,144
8	Llama 4 Maverick	Meta	$0.15	$0.60	1,048,576
9	Gemma 4 31B	Google	$0.00	$0.00	262,144
10	Gemma 4 31B	Google	$0.13	$0.38	262,144

Field notes

Conseils pour (vision + text)

01
Verify the provider's image-size and token-per-image policy before committing — pricing varies dramatically.
02
For structured extraction, combine vision with JSON mode.
03
Chart/graph understanding is still fragile. Add a text-only sanity check if the downstream consequence is high.

FAQ

Questions fréquentes

The questions teams ask before picking a model for (vision + text).

Get instant answers from our AI agent

As of July 2026, our weighted top 3 are Qwen3.5 Plus 2026-02-15, Qwen3.5 397B A17B, GPT-4o (2024-11-20).

Usually cheaper per-image than people assume, but high-res detail modes can 10x the cost. Benchmark on your actual images.

Gemini and a few research-tier OpenAI SKUs handle video. The ecosystem is still early — expect to preprocess to frames for most use cases.

About

Insights

Streamline

Integration

Solutions

Healthcare AI

Use Cases

Industries

Best Multimodal LLM (Vision + Text)

This month’s top three.

Weights tuned for (vision + text).

Modèles les mieux classés

Conseils pour (vision + text)

Questions fréquentes

Model your own workload.

Best Multimodal LLM (Vision + Text)

This month’s top three.

Weights tuned for (vision + text).

Modèles les mieux classés

Conseils pour (vision + text)

Questions fréquentes

Tâches apparentées

Model your own workload.