Architecture patterns

Four patterns, one right answer for your use case.

Cada padrão tem motores de custo, requisitos operacionais e modos de falha distintos. O assistente acima avalia os quatro contra suas entradas específicas.

RAG·For fresh data + cited answers

Retrieval-Augmented Generation

RAG wins when your data changes weekly or faster, citations are mandatory, and your ML team is early-stage. It keeps the base model frozen, embeds your corpus into a vector store, and fetches only the relevant chunks at query time — giving you verifiable outputs and straightforward data governance without a training run.

Pick this when

  • Data updates daily or faster
  • Audit-grade citations are required
  • Corpus exceeds 10 K documents
Read pattern deep-dive
Fine-Tune·For tight latency + domain voice

Parameter-Efficient Fine-Tuning (LoRA / QLoRA)

Fine-tuning shines when your domain has highly specialised vocabulary, a strict output format, or latency requirements below 300 ms. LoRA and QLoRA adapt only a small fraction of model weights, keeping training costs manageable ($1 K–$25 K per run). The resulting model is faster at inference and requires no retrieval hop, but it cannot incorporate new information without a retraining cycle.

Pick this when

  • Domain vocabulary is highly specialised (medical, legal, financial jargon)
  • Consistent output format or tone is required
  • Latency SLA < 300 ms and retrieval hop is unacceptable
Read pattern deep-dive
Long-Ctx·For small, static corpora

Long-Context Prompting

Long-context prompting stuffs your entire relevant document set into the model's context window — up to 1 M tokens with models like Gemini 1.5 Pro or Claude 3.5. It requires zero training, zero vector infrastructure, and delivers an answer in a single API call. It is the right default for small corpora (< 500 documents) with low query volumes, where simplicity outweighs per-query token cost.

Pick this when

  • Corpus fits in < 200 K tokens (a few hundred documents)
  • Query volume < 50 K/month
  • No ML team — zero setup beyond an API key
Read pattern deep-dive
Hybrid·For accuracy + consistent style

Hybrid / Retrieval-Augmented Fine-Tuning (RAFT)

Hybrid (also called RAFT) combines RAG's real-time retrieval with fine-tuning's domain adaptation. The model is trained to reason over retrieved documents — significantly reducing hallucination compared to RAG alone while preserving the ability to incorporate new information. It is the highest-performance option but also the most expensive and operationally complex. Recommended only when capability ≥ 3 and budget ≥ $15 K/mo.

Pick this when

  • Both citation accuracy and domain vocabulary are critical
  • Query volume > 1 M/month (justifies the training investment)
  • Strong in-house ML team (capability ≥ 3)
Read pattern deep-dive

FAQ

Frequently asked questions

Common questions about how the decision engine works and how to interpret your recommendation.

It asks 9 questions about your data freshness, query volume, citation needs, latency SLA, data sensitivity, domain specificity, ML team capability, and budget, then returns a deterministic recommendation — RAG, Fine-Tuning, Long-Context, or Hybrid — plus a four-way cost comparison, an architecture diagram, a risk register, and a CFO-ready PDF.

Get help deciding

Want a second opinion on the recommendation?

Agende uma revisão de arquitetura de 20 minutos com nossa equipe. Verificaremos a pontuação contra suas restrições e compartilharemos notas práticas de implementação.