Architecture Pattern

Long-Context Prompting

Long-context prompting stuffs your entire relevant document set into the model's context window — up to 1 M tokens with models like Gemini 1.5 Pro or Claude 3.5. It requires zero training, zero vector infrastructure, and delivers an answer in a single API call. It is the right default for small corpora (< 500 documents) with low query volumes, where simplicity outweighs per-query token cost.

Per-query input token cost (often 5–20× higher than RAG per query)Cache hit discount (50–75% off cached input tokens)Batch API discount (50% off for async workloads)No infrastructure cost beyond the LLM API

Cost model

  • Per-query input token cost (often 5–20× higher than RAG per query)
  • Cache hit discount (50–75% off cached input tokens)
  • Batch API discount (50% off for async workloads)
  • No infrastructure cost beyond the LLM API

When to pick this pattern

  • Corpus fits in < 200 K tokens (a few hundred documents)
  • Query volume < 50 K/month
  • No ML team — zero setup beyond an API key
  • Prototype or proof-of-concept with a 1-week deadline
  • High cache hit rate expected (same documents reused across queries)
  • Data residency allows hosted-API model providers

When to avoid it

  • Corpus exceeds 500 K tokens (cost becomes prohibitive)
  • Query volume > 200 K/month (RAG becomes cheaper)
  • On-premises or air-gapped requirement (no hosted API access)
  • Strict latency SLA < 500 ms (long context increases TTFT)
  • Output requires precise citations (model may misquote)

Common pitfalls

  • "Lost in the middle" — models attend poorly to content in the centre of very long contexts
  • Costs can surprise teams when query volume spikes unexpectedly
  • Context window limits may force chunking anyway for large corpora

Frequently asked questions

Is Long-Context Prompting right for your workload?

Answer 9 questions to get a deterministic recommendation, cost crossover chart, and PDF report.

Run the full decision wizard