Architecture Pattern

Retrieval-Augmented Generation

RAG wins when your data changes weekly or faster, citations are mandatory, and your ML team is early-stage. It keeps the base model frozen, embeds your corpus into a vector store, and fetches only the relevant chunks at query time — giving you verifiable outputs and straightforward data governance without a training run.

Embeddings (one-time, amortised over ~6 months)Vector DB ($70–$500/mo depending on corpus size)Retrieval + generation inferenceOperational overhead ~15%

Cost model

  • Embeddings (one-time, amortised over ~6 months)
  • Vector DB ($70–$500/mo depending on corpus size)
  • Retrieval + generation inference
  • Operational overhead ~15%

When to pick this pattern

  • Data updates daily or faster
  • Audit-grade citations are required
  • Corpus exceeds 10 K documents
  • ML team is early-stage or non-existent
  • Latency budget ≥ 500 ms end-to-end
  • Monthly budget ceiling ≤ $10 K at launch
  • Data residency must stay on-premises or in EU

When to avoid it

  • Highly specialised vocabulary with no ground-truth corpus
  • Hard latency SLA < 200 ms
  • Static corpus with very high query volume (long-context may be cheaper)
  • Air-gapped environment with no hosted-API access

Common pitfalls

  • Re-ranker layers can blow the latency budget
  • Naive chunking breaks table schemas and code blocks
  • Embedding drift after 6+ months of model updates

Frequently asked questions

Is Retrieval-Augmented Generation right for your workload?

Answer 9 questions to get a deterministic recommendation, cost crossover chart, and PDF report.

Run the full decision wizard