Architecture Pattern

Retrieval-Augmented Generation

RAG wins when your data changes weekly or faster, citations are mandatory, and your ML team is early-stage. It keeps the base model frozen, embeds your corpus into a vector store, and fetches only the relevant chunks at query time — giving you verifiable outputs and straightforward data governance without a training run.

Embeddings (one-time, amortised over ~6 months)Vector DB ($70–$500/mo depending on corpus size)Retrieval + generation inferenceOperational overhead ~15%

Cost model

  • Embeddings (one-time, amortised over ~6 months)
  • Vector DB ($70–$500/mo depending on corpus size)
  • Retrieval + generation inference
  • Operational overhead ~15%

When to pick this pattern

  • ✓Data updates daily or faster
  • ✓Audit-grade citations are required
  • ✓Corpus exceeds 10 K documents
  • ✓ML team is early-stage or non-existent
  • ✓Latency budget ≥ 500 ms end-to-end
  • ✓Monthly budget ceiling ≀ $10 K at launch
  • ✓Data residency must stay on-premises or in EU

When to avoid it

  • ✗Highly specialised vocabulary with no ground-truth corpus
  • ✗Hard latency SLA < 200 ms
  • ✗Static corpus with very high query volume (long-context may be cheaper)
  • ✗Air-gapped environment with no hosted-API access

Common pitfalls

  • ⚠Re-ranker layers can blow the latency budget
  • ⚠Naive chunking breaks table schemas and code blocks
  • ⚠Embedding drift after 6+ months of model updates

Frequently asked questions

Is Retrieval-Augmented Generation right for your workload?

Answer 9 questions to get a deterministic recommendation, cost crossover chart, and PDF report.

Run the full decision wizard →