Architecture Pattern

Retrieval-Augmented Generation

RAG wins when your data changes weekly or faster, citations are mandatory, and your ML team is early-stage. It keeps the base model frozen, embeds your corpus into a vector store, and fetches only the relevant chunks at query time — giving you verifiable outputs and straightforward data governance without a training run.

Embeddings (one-time, amortised over ~6 months)Vector DB ($70–$500/mo depending on corpus size)Retrieval + generation inferenceOperational overhead ~15%

Cost model

Embeddings (one-time, amortised over ~6 months)
Vector DB ($70–$500/mo depending on corpus size)
Retrieval + generation inference
Operational overhead ~15%

When to pick this pattern

✓Data updates daily or faster
✓Audit-grade citations are required
✓Corpus exceeds 10 K documents
✓ML team is early-stage or non-existent
✓Latency budget ≥ 500 ms end-to-end
✓Monthly budget ceiling ≤ $10 K at launch
✓Data residency must stay on-premises or in EU

When to avoid it

✗Highly specialised vocabulary with no ground-truth corpus
✗Hard latency SLA < 200 ms
✗Static corpus with very high query volume (long-context may be cheaper)
✗Air-gapped environment with no hosted-API access

Common pitfalls

⚠Re-ranker layers can blow the latency budget
⚠Naive chunking breaks table schemas and code blocks
⚠Embedding drift after 6+ months of model updates

Related use cases

Internal Docs Assistant

Customer Support Bot

E-commerce / SaaS

Legal Research Assistant

Legal / Compliance

Frequently asked questions

Is Retrieval-Augmented Generation right for your workload?

Answer 9 questions to get a deterministic recommendation, cost crossover chart, and PDF report.

Run the full decision wizard →