Architecture Pattern

Retrieval-Augmented Generation

RAG wins when your data changes weekly or faster, citations are mandatory, and your ML team is early-stage. It keeps the base model frozen, embeds your corpus into a vector store, and fetches only the relevant chunks at query time โ€” giving you verifiable outputs and straightforward data governance without a training run.

Embeddings (one-time, amortised over ~6 months)Vector DB ($70โ€“$500/mo depending on corpus size)Retrieval + generation inferenceOperational overhead ~15%

Cost model

  • Embeddings (one-time, amortised over ~6 months)
  • Vector DB ($70โ€“$500/mo depending on corpus size)
  • Retrieval + generation inference
  • Operational overhead ~15%

When to pick this pattern

  • โœ“Data updates daily or faster
  • โœ“Audit-grade citations are required
  • โœ“Corpus exceeds 10 K documents
  • โœ“ML team is early-stage or non-existent
  • โœ“Latency budget โ‰ฅ 500 ms end-to-end
  • โœ“Monthly budget ceiling โ‰ค $10 K at launch
  • โœ“Data residency must stay on-premises or in EU

When to avoid it

  • โœ—Highly specialised vocabulary with no ground-truth corpus
  • โœ—Hard latency SLA < 200 ms
  • โœ—Static corpus with very high query volume (long-context may be cheaper)
  • โœ—Air-gapped environment with no hosted-API access

Common pitfalls

  • โš Re-ranker layers can blow the latency budget
  • โš Naive chunking breaks table schemas and code blocks
  • โš Embedding drift after 6+ months of model updates

Frequently asked questions

Is Retrieval-Augmented Generation right for your workload?

Answer 9 questions to get a deterministic recommendation, cost crossover chart, and PDF report.

Run the full decision wizard โ†’