Methodology · matrix version 2026-04-01

How we score multi-agent frameworks.

Tre impegni plasmano ogni punteggio in questa pagina: nessun pay-per-placement, nessun punteggio indovinato e nessuna demo-ware. I punteggi vengono da un senior engineer nominato di Buzzi che applica le rubriche pubbliche qui sotto; ogni pagina di framework porta anche un timestamp last_verified_at per auditare la freschezza.

10 frameworks, neutral cards

  • LangGraph

    ×1.0 overhead

    LangChain · MIT · primary python

    repo·docs

  • CrewAI

    ×1.3 overhead

    CrewAI · MIT · primary python

    repo·docs

  • AutoGen / AG2

    ×2.5 overhead

    Microsoft / AG2 community · CC-BY-4.0 / Apache-2.0 · primary python

    repo·docs

  • OpenAI Agents SDK

    ×1.1 overhead

    OpenAI · MIT · primary python

    repo·docs

  • Pydantic AI

    ×1.0 overhead

    Pydantic · MIT · primary python

    repo·docs

  • Anthropic Claude Agent SDK

    ×1.1 overhead

    Anthropic · MIT · primary python

    repo·docs

  • Google Agent Development Kit

    ×1.2 overhead

    Google · Apache-2.0 · primary python

    repo·docs

  • Microsoft Semantic Kernel

    ×1.2 overhead

    Microsoft · MIT · primary multi

    repo·docs

  • LlamaIndex Agents

    ×1.4 overhead

    LlamaIndex · MIT · primary python

    repo·docs

  • Haystack

    ×1.3 overhead

    deepset · Apache-2.0 · primary python

    repo·docs

15 capability axes, with rubrics

  1. Sequential workflows

    Pipeline-style chains where one agent finishes before the next starts.

    10 / 10
    Pipelines are a first-class primitive with explicit ordering and typed handoff.
    5 / 10
    Sequential chains are possible via orchestration code but not a native primitive.
    0 / 10
    Framework cannot guarantee deterministic sequential ordering.
  2. Parallel workflows

    Concurrent fan-out / fan-in across multiple agents.

    10 / 10
    Native parallel execution with built-in result merging and back-pressure.
    5 / 10
    Parallel execution requires custom asyncio / threading code on top.
    0 / 10
    No support for concurrent agent execution.
  3. Hierarchical workflows

    Supervisor-and-worker patterns with delegation and aggregation.

    10 / 10
    Supervisor pattern is documented, idiomatic, and replayable.
    5 / 10
    Achievable but requires hand-rolled message routing.
    0 / 10
    No first-class supervisor primitive.
  4. Adaptive workflows

    Dynamic routing where agents pick the next step based on intermediate state.

    10 / 10
    Router/handoff primitives are first-class with conditional edges.
    5 / 10
    Possible via tool calls but not the framework's sweet spot.
    0 / 10
    Control flow is rigid; no dynamic routing.
  5. State management

    Persistent, typed memory across runs and across agents.

    10 / 10
    Typed state schema, persistent checkpoints, replay support.
    5 / 10
    Session memory is supported; persistence requires external store.
    0 / 10
    Stateless by default; users must build persistence themselves.
  6. Human-in-the-loop

    Pause-resume primitives so humans can approve, edit, or reject actions.

    10 / 10
    Native interrupt/resume with serialisable checkpoints.
    5 / 10
    Approval gates can be bolted on; not a first-class primitive.
    0 / 10
    No interrupt mechanism — the framework runs to completion.
  7. Python support

    Production-grade Python SDK with active maintenance.

    10 / 10
    Reference implementation; active releases; complete typing.
    5 / 10
    Functional Python SDK lagging the primary language.
    0 / 10
    No Python SDK.
  8. TypeScript support

    Production-grade TypeScript / Node SDK at parity with Python.

    10 / 10
    First-class TS SDK with parity to Python in features and types.
    5 / 10
    TS SDK exists but trails Python in feature coverage.
    0 / 10
    No TS SDK.
  9. .NET / Java support

    First-class JVM (Java/Kotlin) and/or .NET SDK.

    10 / 10
    Reference-quality .NET and/or Java SDK with feature parity.
    5 / 10
    Community port or partial SDK.
    0 / 10
    No .NET or Java SDK.
  10. MCP support

    Native Model Context Protocol client and/or server primitives.

    10 / 10
    Authored or reference implementation of MCP.
    5 / 10
    MCP available as an adapter or community plugin.
    0 / 10
    No MCP support.
  11. A2A support

    Native Agent-to-Agent (Google) protocol primitives.

    10 / 10
    Authored or reference implementation of A2A.
    5 / 10
    A2A available via adapter; partial coverage.
    0 / 10
    No A2A support.
  12. Observability

    Tracing, token accounting, replay, and audit-grade logs.

    10 / 10
    Built-in tracing dashboard, structured token accounting, replay, exportable audit log.
    5 / 10
    OpenTelemetry hooks exist; user must wire dashboards themselves.
    0 / 10
    Print-statement debugging only.
  13. Deployment flexibility

    Range of supported deployment targets (cloud, on-prem, edge).

    10 / 10
    Cloud, on-prem, and edge all documented and tested.
    5 / 10
    Cloud-first; on-prem requires extra work.
    0 / 10
    Tied to a single hosted backend.
  14. Maturity

    Production track record, release cadence, community size.

    10 / 10
    2+ years of production use across many large deployments.
    5 / 10
    6-18 months in the wild; growing but evolving rapidly.
    0 / 10
    Pre-1.0; APIs change every release.
  15. Learning curve (higher = easier)

    Time-to-prototype for a developer new to the framework.

    10 / 10
    A working prototype in under 30 minutes from a clean machine.
    5 / 10
    Prototype in half a day with the docs open.
    0 / 10
    Multi-week onboarding before the first useful run.

Scoring formula

# Ranking
weights = buildWeightVector(inputs)        # 15 weights per user input
for fw in frameworks:
    score = sum(fw.capabilities[cap] * weights[cap] for cap in CAPS)
    if hardConstraintFails(inputs, fw):
        score = 0
return sortDesc(scored)

# Cost per task
estimated_tokens_per_task = base_task_tokens
    * framework_overhead_multiplier
    * (1 + (roles - 1) * 0.3)
    * (1.2 if hitl else 1.0)
per_task_usd = (0.7 * tokens / 1M * input_rate)
             + (0.3 * tokens / 1M * output_rate)

Glossary

Hierarchical
A supervisor agent delegates work to sub-agents, reviews their output, and composes the final answer. Good for multi-stage tasks with clear ownership.
Adaptive
Agents decide dynamically which other agents or tools to invoke based on intermediate results. Best when the control flow cannot be fixed upfront.
Agent
A named role with its own prompt, tools, and memory. "Roles" counts unique agent identities, not the number of LLM calls.
HITL (Human-in-the-Loop)
The workflow pauses for a human to approve, edit, or reject an agent action before continuing. Critical for regulated or high-risk automations.
MCP (Model Context Protocol)
Anthropic-led open standard for connecting LLM agents to tools, data, and other servers. Look for MCP support if you want vendor-portable tool integrations.
A2A (Agent-to-Agent Protocol)
Google-led open standard for agents from different vendors to discover and call each other. Emerging spec; relevant for federated agent systems.
Observability
Structured traces, token accounting, replayable runs, and exportable audit logs. "Regulated-grade" means immutable audit trails and retention controls.

Public dataset

La matrice completa delle capacità è pubblicata come JSON per motori IA e ricercatori:

FAQ

  1. Come vengono assegnati i punteggi?

    Un senior engineer nominato di Buzzi valuta ogni framework su ogni asse usando le rubriche pubbliche di questa pagina. I punteggi sono rivisti trimestralmente; pubblichiamo il timestamp last_verified_at per framework nel dataset pubblico.

  2. I vendor pagano per il posizionamento?

    No. I punteggi sono editoriali e mai venduti. Le richieste di modifica del punteggio devono essere depositate come PR pubbliche sul repository aperto della matrice con giustificazione tecnica.

  3. Come decidete quali framework tracciare?

    Repository GitHub attivi con oltre 10k stelle o sostenuti da Anthropic, Google, Microsoft, OpenAI o LangChain. Aggiungiamo o ritiriamo framework una volta a trimestre in base allo slancio e all'uso in produzione.

  4. Come viene calcolato il costo per task?

    estimated_tokens_per_task = base_task_tokens × framework_overhead_multiplier × (1 + (ruoli − 1) × 0,3) × (1,2 se HITL altrimenti 1,0). Le tariffe dei token vengono dalla nostra tabella llm_models; gli utenti possono sovrascrivere il modello nel wizard.

  5. Come funzionano i vincoli rigidi?

    Lo stack .NET si restringe a Microsoft Semantic Kernel. Java si restringe a Semantic Kernel o Google ADK. TypeScript con osservabilità di livello compliance si restringe a LangGraph.js, OpenAI Agents SDK o Anthropic Claude SDK. I framework squalificati sono mostrati con la motivazione.

  6. Dove posso depositare una correzione?

    Apri una pull request contro il repository buzzi-ai/agent-framework-matrix o invia una email a research@buzzi.ai. Esaminiamo le richieste di correzione entro 10 giorni lavorativi.

Hai trovato un punteggio con cui non sei d'accordo?

Apri una PR sul repository aperto della matrice o invia una email a research@buzzi.ai. Tutte le richieste di correzione ricevono una risposta pubblica entro 10 giorni lavorativi.

Torna al selettore