免费 · 90 秒 · 免登录

这个工作负载该跑在前沿 LLM 上,还是小语言模型上?

描述你的工作负载。我们对比 10 个模型——前沿 LLM 与 SLM——在月度成本、特定任务准确度、延迟匹配和数据驻留上的表现。答案附带正确的托管模式。

工作原理

三项输入,一个决策。
无需 token,无需电子表格。

  1. 描述

    告诉我们工作负载的情况。

    九项输入:任务、量级、token 概况、准确度容忍度、延迟 SLA、数据驻留、语言、当前支出。约 90 秒。

  2. 评分

    规则引擎,而非感觉。

    硬过滤剔除任何不符合数据驻留、语言或准确度的选项。软评分对成本(35%)、任务准确度(35%)、延迟匹配(15%)和主权加分(15%)进行排序。

  3. 决策

    前 3 名加托管模式。

    10 个模型的成本并排对比。正确的托管模式(API / 托管 / 自托管 / 本地部署)。相对于今日支出的节省金额。

适用人群

为 AI 账单成为董事会议题的那一刻而打造。

  • CTO / 工程副总裁

    AI 账单增长了 5 倍——质疑你是否仍需要前沿 LLM。候选名单加盈亏平衡告诉你答案。

  • CFO / 财务

    需要为董事会提供一个可辩护的节省数字。输入当前支出;结果以美元呈现。

  • AI 负责人 / ML 主管

    在做架构评审。带匹配分数和准确度差异的前 3 名;一周内可启动 PoC。

  • 主权 AI 技术创始人

    数据驻留或国家 AI 政策是首要过滤条件。工具基于价值呈现区域对齐的 SLM(Mistral、Qwen、Falcon、BharatGen)。

方法论

确定性。可复现。有引用。

评分引擎基于规则——热路径上没有 LLM 调用。相同输入始终产生相同候选名单。定价通过共享的 Buzzi LLM 定价数据库(工具 01)每月刷新,每日快照定时任务捕捉月中变动。基准测试按来源引用,绝不杜撰。

无供应商赞助。

定价不接受付费推广。

基准引用,绝不杜撰。

阅读完整方法论

常见问题

关于 SLM vs LLM 的常见问题。

What does this tool do?

It takes nine details about your AI workload — task, volume, token profile, accuracy tolerance, latency SLA, residency, language, current spend — and returns a side-by-side monthly cost across 10 models, an accuracy delta on your task, the right hosting mode, and a top-3 shortlist with fit scores. No login, runs in 90 seconds.

How is this different from the LLM Pricing Comparison tool?

LLM Pricing Comparison compares token prices across models you pick. This tool picks models for a workload you describe. Same dataset, two lenses for two different buyer moments.

What's the difference between an SLM and an LLM?

SLM ≈ Small Language Model, typically 1–10B parameters with task-specific accuracy that matches frontier models on narrow tasks at a fraction of the cost. LLM = frontier general-purpose models like GPT-5, Claude Opus 4.7, Gemini 2.5 Pro that are stronger on agentic and reasoning workloads.

When does a small language model win?

Classification, extraction, summarization, translation. Cost-sensitive workloads at high volume. Residency-constrained deployments. Latency-critical paths where every millisecond counts. Anywhere accuracy on the specific task is good enough at much lower cost.

What assumptions does the cost formula make?

Monthly volume × average input tokens × published input price + monthly volume × average output tokens × published output price. Caching discount of up to 90% applied per cache-hit-rate; batch discount up to 50% applied when "Batch-tolerant" is selected. Self-hosted cost adds amortized setup + GPU monthly.

How much do caching and batch discounts change the numbers?

Up to 90% off the input portion when cache-hit-rate is 100% (rare). 50% off the total when batch mode is selected. Real workloads typically see 20–40% savings from caching, 50% from batch on async workloads.

How accurate are the benchmark scores?

They are public-benchmark proxies, not your workload. Strongly recommend a 100–500 sample PoC before committing. Benchmarks come from Artificial Analysis, HuggingFace Open LLM Leaderboard, Stanford HELM, HumanEval / MBPP, AgentBench, plus task-specific suites.

How do I pick the right hosting mode?

Use the matrix: under 100K queries/month → API. 100K–1M with EU residency → managed inference in EU. >1M with sub-second latency → self-hosted GPU. On-prem or air-gapped requirements → open-weight SLM on your hardware.

When does self-hosted beat API?

Typically past 1M–10M queries/month depending on token profile. The break-even chart on the results page shows the exact crossover for your inputs.

How do I size a GPU for self-hosted Llama 3 / Phi-3 / Mistral?

Use the min_vram_gb column on each model card. Phi-3.5 Mini fits on an L4 (24GB). Llama 3.x 8B + Mistral 7B comfortably on a single A100 40GB. Llama 3.3 70B needs 2× A100 80GB minimum at production throughput.

What are the implications of data residency?

Frontier APIs offer some regional hosting (Anthropic EU, OpenAI EU via Azure, Gemini in EU/SG/IN). For strict on-prem only open-weight SLMs apply: Llama, Mistral, Phi, Qwen, Falcon, BharatGen.

Which models are best for multilingual workloads?

Qwen for Chinese / Japanese / Korean. Mistral for European languages. Llama 3.x for broad multilingual baseline. GPT-5 / Claude Opus / Gemini 2.5 Pro for global coverage when budget allows.

What regional SLMs should I know about?

Mistral (EU sovereign), Falcon (UAE / TII), Qwen (APAC), BharatGen (India). The tool surfaces these neutrally on cost + compliance + language merit when residency is selected — not by default.

How often is the data updated?

Pricing — monthly vendor refresh + human review, with a daily snapshot cron catching mid-month moves. Benchmarks — quarterly. Sovereign-model coverage — quarterly + as new models ship.

Does Buzzi have a vendor bias?

No. No vendor sponsorships, no pay-to-play placement, every benchmark cited with source URL and capture date. We list all models we track and rank them on cost, accuracy, latency, residency — not relationships.

准备好迁移了吗?

在不损失准确度的前提下,将 AI 账单削减 30–60%。

Buzzi 已为运行分类、抽取和大规模 RAG 的团队交付 SLM 迁移。两周 PoC、四周迁移、真实成本数据。