Best LLM for Function Calling / Tool Use
Ranked on tool-selection accuracy, multi-tool consistency, and price. Tool-use quality compounds in agent loops.
Updated April 2026. Top 3 this month: GPT-5, Gemini 2 Pro, Claude Opus 4.7.
How we rank
Function calling is the connective tissue of agent systems. A model that picks the wrong tool once in 20 calls is unacceptable for any non-trivial automation. We weight tool-selection accuracy and multi-tool benchmarks heavily, then price.
Pillars and weights: tool selection (45%) · multi-tool (30%) · price (25%). Our full methodology is published on the methodology page.
Top ranked models
| Rank | Model | Provider | Input $/1M | Output $/1M | Context |
|---|---|---|---|---|---|
| 1 | GPT-5 | OpenAI | $1.25 | $10.00 | 200,000 |
| 2 | Gemini 2 Pro | $3.50 | $10.50 | 2,000,000 | |
| 3 | Claude Opus 4.7 | Anthropic | $5.00 | $25.00 | 200,000 |
| 4 | GPT-5 nano | OpenAI | $0.05 | $0.40 | 400,000 |
| 5 | Gemini 2.0 Flash | $0.10 | $0.40 | 1,000,000 | |
| 6 | GPT-4.1 nano | OpenAI | $0.10 | $0.40 | 1,000,000 |
| 7 | GPT-4o mini | OpenAI | $0.15 | $0.60 | 128,000 |
| 8 | DeepSeek V3.2 | DeepSeek | $0.27 | $1.10 | 128,000 |
| 9 | DeepSeek V3 | DeepSeek | $0.27 | $1.10 | 128,000 |
| 10 | GPT-4.1 mini | OpenAI | $0.40 | $1.60 | 1,000,000 |
Tips for function calling / tool use
- Keep the tool list short and well-named. Long tool lists degrade accuracy.
- Use JSON schemas with required fields to reduce malformed calls.
- Log tool failures and retry with a fallback model tier if needed.
Frequently asked questions
Which LLM is best at tool use?
As of April 2026, our weighted top 3 are GPT-5, Gemini 2 Pro, Claude Opus 4.7.
How many tools is too many?
Accuracy drops noticeably past ~30 tools in a single call. Route to a smaller toolset per conversation turn when you can.
Are tool-use benchmarks trustworthy?
Directionally yes — run the top 2 on your actual tool catalog before committing.
Related tasks
Want to model your own workload? Use the volume and switch-cost calculators on the main tool page. Sign in with Google to unlock compare-my-prompt with real tokenizer counts.
Data refreshed daily via our snapshot cron. See our public JSON API for programmatic access.