Méthodologie.

Exactly how the comparison, the calculators, and the Best-for rankings are built — so you, and the AI engines citing our data, can trust the output.

Sources Notation Intelligence Index Formules Intégrité FAQ

D'où viennent les données

Sourcé, horodaté, auditable.

Every model row has a last_verified_at timestamp. Models not re-verified within 30 days are flagged in the admin UI for refresh.

Pricing & specs
Official provider pricing pages and API docs. Input, output, cached, batch — all as published.
Benchmarks
Provider model cards first, then widely-cited third-party leaderboards (MMLU, SWE-Bench, HumanEval, GPQA, AIME, MMMU, DocVQA). Source noted on each row.
Regions & compliance
Provider trust centers and certification pages: SOC 2, HIPAA, GDPR, FedRAMP, regional data-residency.

Cadence de mise à jour

Rafraîchi chaque matin, audité à chaque changement.

Étape 01
Daily · 02:00 UTC
Snapshot cron
Captures current price and status; diffs against yesterday to detect changes.
Étape 02
Daily · 02:30 UTC
Alerts cron
Emails subscribed users about price changes, deprecations, and sunsets.
Étape 03
Monthly · 1st @ 09:00 UTC
Market Pulse newsletter
One short email with the month’s price moves, new launches, and quiet deprecations.
Étape 04
Ad hoc
Admin edits
New launches land the same day. Every change is written to a public audit log.

Notation des pages Best-for

Les poids correspondent à la tâche.

Each Best-for page defines a set of pillars with explicit weights (visible at the top of the page). For tasks where quality dominates economics — reasoning, agents, healthcare — price is weighted under 30%. For tasks where price dominates — cheap-bulk, long-context with large input — price is weighted above 40%.

Missing benchmarks are treated as the category median. We don't assume a model is bad because a score isn't published.

Quality benchmarks
MMLU, SWE-Bench, HumanEval, GPQA, AIME, MMMU, DocVQA.
Price
Input + output per 1M tokens.
Memory
Context window size.
Capabilities
Function calling, JSON mode, vision, structured output.

Buzzi Intelligence Index

Un score, six benchmarks, poids explicites.

The Quality-vs-Price scatter on the results page uses our own composite score (0–100) built from published benchmark scores. Missing benchmarks fall back to the category median so a model isn't penalized for data we don't have.

We don't import Artificial Analysis's index or any third-party composite. The math is ours and the inputs are auditable.

25%
MMLU
Broad knowledge and reasoning — 57 subjects.
20%
GPQA
Expert-level science questions (physics, chemistry, biology).
20%
HumanEval
Python code generation from docstrings.
20%
SWE-Bench
Real-world GitHub issue-fixing tasks.
15%
MMMU
Multimodal (text + image) college-level problems.
10%
AIME
High-school math olympiad problems.

Weights sum to 1.1 before normalization, so a model that covers all six benchmarks with 100/100 scores scores exactly 100. Missing benchmarks cause the denominator to shrink proportionally.

Formules de coût

Les calculs, détaillés.

Volume cost uses the standard per-million-tokens model. Switch cost assumes a 40-hour engineering week at your chosen rate, with a configurable risk premium.

monthly_cost = (uncached_input_tokens  / 1M) × input_price_per_1M
             + (cached_input_tokens    / 1M) × cached_input_price_per_1M
             + (batch_input_tokens     / 1M) × batch_input_price_per_1M
             + (standard_output_tokens / 1M) × output_price_per_1M
             + (batch_output_tokens    / 1M) × batch_output_price_per_1M

Token counts come from the provider's own tokenizer when we have it (tiktoken, o200k), otherwise a family coefficient with a ±7% error envelope.

Ce que nous ne faisons pas

Trois règles qui gardent les données honnêtes.

No sponsorships.
We do not take money from LLM providers. No affiliate fees, no paid placements.
No vibes.
We do not weight gut feelings. Every rank is a formula you can audit.
No guessed benchmarks.
If a score has no citable source, we treat the model as median rather than invent a number.

Questions fréquentes

Comment nous travaillons — en détail.

The non-obvious parts of sourcing, scoring, and refreshing the data.

Get instant answers from our AI agent

A snapshot cron runs every night at 02:00 UTC and captures current prices from provider pricing pages. A second cron at 02:30 UTC emails subscribers about changes or deprecations. Spot-checks from third-party aggregators (pricepertoken, costgoat) catch any misses.

No. Ranking is never pay-to-play. Providers pay us nothing. The methodology documents every weight so anyone can reproduce our rankings.

Our index uses our own weights (MMLU 0.25, GPQA 0.2, HumanEval 0.2, SWE-Bench 0.2, MMMU 0.15, AIME 0.1) and pulls from published benchmark scores on the provider's model card. We don't import any third-party composite score — the math and inputs are both ours and auditable.

Missing benchmarks fall back to the cohort median for that benchmark, so a model isn't penalized for data the provider hasn't published. The Intelligence Index detail object flags which scores are "published" versus "median" so you know which numbers are direct and which are estimates.

They get pricing_availability="self_host" and don't appear in cost calculations. If the same weights are hosted on Together, Fireworks, or Replicate, we include that row with pricing_availability="estimated" and cite the host in the pricing_url.

"Public" = verified directly from the provider's pricing page. "Estimated" = open-weight model where we cite a common hosting provider's published price. "Self-host" = no managed endpoint exists, so cost depends on your own hardware. "Unknown" = announced but not yet priced publicly; the model is shown but excluded from cost calculations.

Rankings change whenever inputs do — a price drop, a new benchmark score, a deprecation. The pages are ISR-cached for 10 minutes, so if you see a shift it's because the underlying inputs moved.

Email [email protected] with a link to the source. We correct within 24 hours and log the change in our public audit trail.

Brand marks. Provider logos shown across the comparison tool are used under nominative fair use for factual product comparison. All marks are property of their respective owners. Where an official logo isn't available, we display a generated monogram wordmark as a placeholder.

Erreur repérée ?

Les corrections sont bienvenues.

Spotted a missing model or a stale price? Email us with a link to the source. We typically correct within 24 hours.

[email protected]

About

Insights

Streamline

Integration

Solutions

Healthcare AI

Use Cases

Industries

Méthodologie.

Sourcé, horodaté, auditable.

Pricing & specs

Benchmarks

Regions & compliance

Rafraîchi chaque matin, audité à chaque changement.

Snapshot cron

Alerts cron

Market Pulse newsletter

Admin edits

Les poids correspondent à la tâche.

Quality benchmarks

Price

Memory

Capabilities

Un score, six benchmarks, poids explicites.

MMLU

GPQA

HumanEval

SWE-Bench

MMMU

AIME

Les calculs, détaillés.

Trois règles qui gardent les données honnêtes.

No sponsorships.

No vibes.

No guessed benchmarks.

Comment nous travaillons — en détail.

Les corrections sont bienvenues.