Architecture Pattern

Hybrid / Retrieval-Augmented Fine-Tuning (RAFT)

Hybrid (also called RAFT) combines RAG's real-time retrieval with fine-tuning's domain adaptation. The model is trained to reason over retrieved documents — significantly reducing hallucination compared to RAG alone while preserving the ability to incorporate new information. It is the highest-performance option but also the most expensive and operationally complex. Recommended only when capability ≥ 3 and budget ≥ $15 K/mo.

All RAG costs (embeddings, vector DB, retrieval inference)Fine-tuning training run ($5 K–$25 K amortised)Hosted fine-tuned + RAG inference (30–40% premium over RAG alone)RAFT dataset construction (often 2–4 weeks of ML engineering time)

Cost model

All RAG costs (embeddings, vector DB, retrieval inference)
Fine-tuning training run ($5 K–$25 K amortised)
Hosted fine-tuned + RAG inference (30–40% premium over RAG alone)
RAFT dataset construction (often 2–4 weeks of ML engineering time)

When to pick this pattern

✓Both citation accuracy and domain vocabulary are critical
✓Query volume > 1 M/month (justifies the training investment)
✓Strong in-house ML team (capability ≥ 3)
✓Budget ≥ $15 K/month
✓Regulatory requirement for both provenance and specialised terminology

When to avoid it

✗ML team capability < 3
✗Budget < $10 K/month
✗Time-to-production < 4 weeks
✗Corpus changes faster than weekly (retraining cadence cannot keep up)

Common pitfalls

⚠RAFT dataset construction is labour-intensive — requires paired (document, query, answer) triples
⚠Two failure modes to debug instead of one (retrieval failures + model failures)
⚠Training run must be repeated when base model is deprecated