Architecture Pattern

Long-Context Prompting

Long-context prompting stuffs your entire relevant document set into the model's context window — up to 1 M tokens with models like Gemini 1.5 Pro or Claude 3.5. It requires zero training, zero vector infrastructure, and delivers an answer in a single API call. It is the right default for small corpora (< 500 documents) with low query volumes, where simplicity outweighs per-query token cost.

Per-query input token cost (often 5–20× higher than RAG per query)Cache hit discount (50–75% off cached input tokens)Batch API discount (50% off for async workloads)No infrastructure cost beyond the LLM API

Cost model

Per-query input token cost (often 5–20× higher than RAG per query)
Cache hit discount (50–75% off cached input tokens)
Batch API discount (50% off for async workloads)
No infrastructure cost beyond the LLM API

When to pick this pattern

✓Corpus fits in < 200 K tokens (a few hundred documents)
✓Query volume < 50 K/month
✓No ML team — zero setup beyond an API key
✓Prototype or proof-of-concept with a 1-week deadline
✓High cache hit rate expected (same documents reused across queries)
✓Data residency allows hosted-API model providers

When to avoid it

✗Corpus exceeds 500 K tokens (cost becomes prohibitive)
✗Query volume > 200 K/month (RAG becomes cheaper)
✗On-premises or air-gapped requirement (no hosted API access)
✗Strict latency SLA < 500 ms (long context increases TTFT)
✗Output requires precise citations (model may misquote)

Common pitfalls

⚠"Lost in the middle" — models attend poorly to content in the centre of very long contexts
⚠Costs can surprise teams when query volume spikes unexpectedly
⚠Context window limits may force chunking anyway for large corpora