Scoring Methodology

The RAG vs Fine-Tuning Decision Engine scores four architecture classes — RAG, Fine-Tuning, Long-Context, and Hybrid — against nine dimensions of your use case. This page explains how each dimension is weighted, how cost estimates are derived, and how confidence and risk are reported.

1. The nine scoring dimensions

Each dimension contributes positive or negative points to one or more architecture classes. Points are not percentages — they are additive signals. The class with the highest total score wins. The margin between the first and second class determines confidence.

Data Freshness
How frequently your source data changes. Real-time data (1) strongly favours RAG because fine-tuned models cannot incorporate new information without a retraining cycle. Static data (5) removes RAG's key advantage.
Document Volume
The size of your knowledge corpus. Tiny corpora (<10K docs, score 1) may fit in a long-context window. Massive corpora (>10M docs, score 5) rule out long-context and strongly favour vector-based retrieval.
Monthly Query Volume
Total inference calls per month. At very high volumes (>1M/mo), per-query retrieval costs compound and can make fine-tuning more cost-efficient. At low volumes (<10K/mo), infrastructure overhead tips the balance toward long-context.
Citation Accuracy
Whether your use case requires verifiable source references. Audit-grade citation (4) strongly favours RAG or hybrid, because fine-tuned models hallucinate provenance — they cannot cite sources they did not see at training time.
Latency SLA
Your end-to-end latency budget in milliseconds. RAG adds a retrieval hop of 100–400 ms. If your SLA is below 500 ms, fine-tuning (no retrieval) may be necessary. Long-context adds TTFT overhead at large token counts.
Data Sensitivity
Regulatory and confidentiality classification of your data. High sensitivity (4–5) limits which hosted API providers you can use for retrieval, and may require self-hosted embedding and inference infrastructure.
Domain Specificity
How specialised your domain vocabulary and output format are. Highly specialised domains (4–5) with proprietary jargon, output schemas, or brand voice benefit more from fine-tuning's weight-level adaptation than from retrieval alone.
ML Capability
Your in-house ML engineering maturity (1 = no ML team, 5 = world-class). Fine-tuning and hybrid architectures require ML expertise to design, train, evaluate, and maintain. Low capability teams should default to RAG or long-context.
Budget Ceiling
Maximum monthly spend. If the estimated cost of the leading approach exceeds 120% of your ceiling, the engine applies a penalty. Budget < $2K generally rules out hybrid; <$5K may rule out fine-tuning when training is amortised.

2. Compound signals

Beyond individual dimension scores, the engine applies compound signals that capture interactions between dimensions:

High volume + strict citations: If monthly queries ≥ 1M and citations = 4, Hybrid receives an additional +20 because RAFT amortises training cost while preserving citation accuracy.
Low volume + low budget + not air-gapped: Long-context receives +15 because standing up vector infrastructure is not economically justified.
On-premises or air-gapped: Fine-Tuning and Hybrid receive +15/+10 because they can be deployed self-hosted, while long-context (requiring hosted API calls) is penalised −20.
Budget penalty: If the estimated monthly cost of an approach exceeds 120% of your stated ceiling, that approach receives −15 points.

3. Cost estimation methodology

Cost estimates are derived from your monthly query volume, average token counts, and live LLM pricing data fetched from our model database. The formula for each class:

RAG (monthly)

Embedding one-time cost (amortised over 6 months) + Vector DB fee (tiered by corpus volume) + Retrieval tokens (gen model input price) + Generation input & output tokens + 15% operational overhead.

Fine-Tuning (monthly)

Training run cost ($1,200–$25,000, driven by specificity) amortised over 6 months + Fine-tuned inference at 1.2× base model price + Retraining reserve (2× initial cost / year).

Long-Context (monthly)

Per-query document tokens × gen model input price + Output tokens × output price, minus prompt-cache savings (your cache hit rate × 70% discount) and batch-API savings (your batch eligible rate × 50% discount).

Hybrid / RAFT (monthly)

All RAG costs + 60% of Fine-Tuning costs (reflects the reality that RAFT requires both retrieval infrastructure and a training run, but query-time inference is more efficient than pure RAG).

Vector DB pricing is tiered by corpus volume (1–5 scale mapping to $70–$3,000/mo), based on observed pricing from pgvector, Pinecone, Weaviate, and Qdrant as of Q1 2026. LLM token prices are pulled live from our model database and fall back to conservative defaults ($3/1M input, $12/1M output) if the database is unavailable.

4. Confidence margin

Confidence is determined by the point margin between the winning class and the runner-up:

High confidence: margin ≥ 25 points — one approach dominates clearly.
Medium confidence: margin 10–24 points — a clear leader but the runner-up is viable.
Low confidence: margin < 10 points — multiple approaches are closely matched; a proof-of-concept with both is recommended.

If the winning score is below 40, the engine also sets a “rescope flag” indicating that no single approach dominates — typically a sign that the use case scope should be narrowed before committing to infrastructure.

5. Risk register

The engine evaluates seven risk triggers against your inputs and the winning recommendation. Each risk has a severity level (high, medium, or low) and a mitigation recommendation:

Hallucinated Citations Risk (high): Fine-Tuning recommended + citations ≥ 3.
Budget Ceiling at Risk (medium): Estimated cost > 90% of your stated ceiling.
Data Residency Violation Risk (high): EU residency or high sensitivity + Long-Context recommended.
ML Capability Gap (medium): Capability ≤ 2 + Fine-Tuning or Hybrid recommended.
Stale Pricing Data (low): Vector DB pricing data is older than 90 days.
Corpus Drift Risk (medium): Freshness ≤ 2 + Fine-Tuning recommended.
Latency Budget at Risk (high): Latency SLA < 500 ms + RAG or Hybrid recommended.

6. Limitations and assumptions

Cost estimates are indicative only. Actual costs depend on provider, model size, infrastructure configuration, and negotiated pricing.
The scoring model is intentionally opinionated and based on observed production patterns at Buzzi clients as of Q1 2026. It is not a substitute for architectural review by an experienced ML engineer.
The engine does not model multi-tenancy, A/B testing overhead, evaluation pipeline cost, or data-labelling cost for fine-tuning.
Hybrid / RAFT cost assumes a single retraining cycle per 6-month window. Teams with more frequent retraining needs should increase the training amortisation divisor.

Run the decision engine Talk to an architect

About

Insights

Streamline

Integration

Solutions

Healthcare AI

Use Cases

Industries

Scoring Methodology

1. The nine scoring dimensions

Data Freshness

Document Volume

Monthly Query Volume

Citation Accuracy

Latency SLA

Data Sensitivity

Domain Specificity

ML Capability

Budget Ceiling