Large Language Model Development Cost: 2025 Reality

If your plan is to “build our own LLM,” your first deliverable isn’t a model—it’s a capital allocation memo. In 2025, large language model development from scratch routinely lands in the $10M–$100M+ band once you count compute, data, and the team needed to make it reliable.

We’re seeing real board and executive pressure to have a “proprietary” model. Sometimes that’s a legitimate moat. Often it’s an identity project: “We’re an AI company, therefore we must own the weights.” The economics don’t care about that story.

There’s also a category error hiding in plain sight. “Build an LLM” can mean pretraining a foundation model, continuing pretraining on your domain, fine-tuning for behavior, or deploying retrieval-augmented generation (RAG) on top of existing models. These are different projects with radically different budgets, risks, and timelines.

In this guide we’ll lay out a CFO-grade cost model—compute, data, talent, infrastructure, and governance—then use it to make build-vs-buy decisions that don’t require hype or vendor theater. At Buzzi.ai, we build AI agents and enterprise implementations grounded in outcomes and total cost of ownership (TCO), not wishful thinking.

By the end, you’ll have three viable paths: buy/host foundation models, fine-tune selectively, or combine models with retrieval and workflow automation to ship value in months—not years.

What “large language model development” really means (and why it’s confusing)

The phrase “large language model development” is overloaded. That’s not an accident. When categories blur, budgets blur too—and suddenly the conversation shifts from “Can we justify this spend?” to “Can we get a demo by next month?”

Let’s define terms once, then use them consistently.

Three different projects that get labeled “build an LLM”

1) From-scratch pretraining (new base model) is what most people imagine: you start with random weights, train on a massive corpus, and produce a new foundation model. You own the weights and the roadmap, but you also own the full risk surface: data rights, safety, evaluation, and years of iteration.

2) Domain adaptation / continued pretraining means you take an existing base model and keep pretraining it on domain-specific text (or a mixture). This can improve vocabulary, style, and domain reasoning—especially when you have large, high-quality proprietary text. It’s materially cheaper than from-scratch, but it still looks like “real training,” with real iteration costs.

3) Fine-tuning (SFT/LoRA) is about behavior: instruction-following, tone, formats, and task performance for a defined distribution. Techniques like LoRA/adapters reduce compute and let you upgrade the underlying foundation models later without redoing everything. You get faster time-to-value, but you still need curated data and rigorous evaluation to avoid “works in demo, fails in production.”

Vendors blur these categories because “custom LLM” sounds like from-scratch ownership but is often fine-tuning plus a prompt template. That can be totally fine—if you price it like fine-tuning, and if you measure it like a product.

A quick vignette we see often: a bank asks for a “proprietary LLM.” When you unpack the request, they actually need governed Q&A and drafting inside a controlled domain (policies, product docs, compliance checklists) with citations, audit logs, and access controls. That’s a RAG + workflow problem, not a pretraining problem.

In practice, most enterprise “custom LLM” work uses foundation models (hosted or on-prem) plus a layer of retrieval, tools, and guardrails. Sometimes you add fine-tuning on top. Rarely do you start from scratch.

The two cost centers people forget: iteration and reliability

People budget for the “one big training run.” Reality is that you pay per training run, per failed run, and per evaluation cycle. Large language model development cost doesn’t come from a single line item; it comes from the loop.

Reliability work also gets underestimated because it doesn’t look like research. It looks like a product team: evaluation harnesses, regression tests, safety gates, incident response, and ongoing monitoring. In regulated workflows, the reliability bar is not “it answered correctly most of the time.” It’s “we can predict and bound failure modes.”

Example: if hallucinations in a regulated workflow must be below 1% on an internal eval set, you may need multiple rounds of data improvements, prompt changes, retrieval adjustments, and fine-tuning iterations. Each loop has compute costs, but it also has human costs: SMEs labeling edge cases, QA running suites, and engineers instrumenting the system.

The hidden multiplier in LLM projects is not “model size.” It’s how many times you have to iterate before you can trust the system in production.

The CFO view: the five line items that drive LLM cost

Every serious LLM initiative—whether you buy, fine-tune, or pretrain—spends money in five buckets. The trick is that only one of them (compute) is visible on day one.

Finance team reviewing large language model development cost line items in a strategy meeting

Compute: pretraining vs fine-tuning vs inference

Compute splits into two fundamentally different bills: training compute (bursty, experimental, capital-intensive) and inference (recurring, scales with usage). The former scares CFOs up front; the latter surprises them after success.

Training is measured in GPU compute hours. What matters more than list price is utilization: how often your GPUs sit idle due to data bottlenecks, failed jobs, scheduling, or people waiting to review eval results. “Cheap GPUs” at 35% utilization are expensive.

The experimentation tax is real. You rarely do one training run. You do ablations, change tokenizers, fix data leaks, restart after instability, and rerun evaluation cycles. In a CFO model, assume multiple runs unless you have unusually mature data and evaluation discipline.

Inference is where costs scale with adoption: context length, response length, concurrency, latency targets, redundancy, and safety checks. A successful internal assistant can become an ongoing infrastructure product with a recurring cloud bill unless you optimize routing, caching, and context management.

Data: token budget, cleaning, rights, and labeling

For cost modeling, “data” isn’t gigabytes. It’s a token budget with quality constraints. Training on low-quality tokens can be worse than not training at all: you pay compute to learn noise, and the model becomes brittle or memorizes sensitive content.

Data costs show up as:

Acquisition or licensing (including “free” datasets that require legal review)
Deduplication, filtering, and quality scoring
PII detection and removal for GDPR-compliant AI development
Labeling for instruction tuning and evaluation sets

One common misconception: “We have tons of PDFs.” In most enterprises, that’s not a pretraining dataset. It’s a retrieval corpus. Unless you curate it, normalize it, and establish provenance, those PDFs are better used via retrieval-augmented generation than as training data.

Talent: research, engineering, and the unglamorous ops roles

Team composition drives both burn rate and velocity. A pretraining-heavy effort needs more research depth. A fine-tune+RAG product skews toward product engineering and reliability. Either way, the headcount required to build a production system is larger than most first drafts.

A realistic roster includes LLM/application engineers, data engineers, MLOps, security/compliance, product, QA/evals, and domain SMEs. Those last two—evaluation and SMEs—often get treated as “nice to have.” They’re the measurement system. Without them, you don’t know whether you’re improving or just changing.

Also: hiring friction is part of the economics. If a project requires scarce roles, your timeline stretches, your opportunity cost increases, and your total burn rises even if the model costs are unchanged.

Infrastructure & tooling: storage, pipelines, evaluation, monitoring

GPUs are the headline, but the project runs on everything around them: storage, networking, orchestration, experiment tracking, evaluation harnesses, model registries, and monitoring. In RAG systems, you add vector search, indexing pipelines, and document governance.

Even if you never touch pretraining, you still need MLOps for LLMs: prompt/version management, test suites, canary deployments, rollback plans, and observability that can answer “What did the model see?” and “Why did it respond that way?” when an incident happens.

Risk & governance: security, compliance, and safety as budget lines

In enterprise settings, “risk” is not a footnote. It’s a schedule and budget line. Governance includes access control, audit trails, data residency, vendor risk reviews, prompt injection defenses, and policy enforcement.

Safety work includes red teaming, jailbreak testing, and human-in-the-loop design for high-stakes actions. The more your agent can do—send emails, update records, initiate refunds—the more you pay to make it safe.

Scenario: a support agent tool uses RAG over internal knowledge plus a “search customer account” tool. A prompt injection embedded in a ticket could trick the system into revealing unrelated customer data. Mitigations include content sanitization, tool permissioning, strict retrieval filters, and auditing. None are free. All are cheaper than a breach.

For a governance baseline, the NIST AI Risk Management Framework (AI RMF) is a useful north star—especially for aligning security, compliance, and product teams around shared controls.

Compute requirements in plain numbers: what changes at 7B vs 70B parameters

“How much does it cost to develop a large language model?” usually turns into a parameter-count debate. Parameter count matters, but it’s a proxy for budget, not a goal. What you want is performance on your tasks at an acceptable risk level.

Parameter count is a proxy for budget, not a goal

Parameters correlate with training compute and inference costs. Bigger models can be more capable, but they’re also more expensive to train and to serve. And scaling has diminishing returns: you often get more enterprise value by improving data quality, retrieval, and workflow integration than by doubling parameters.

Contrast two products:

A general-purpose chat assistant competing with top consumer models (hard problem, broad distribution).
A contract Q&A assistant that must answer from your approved templates and policies, with citations (narrower scope, strong governance).

The second can succeed with smaller models plus retrieval-augmented generation and strong evaluation. The first is where parameter count becomes destiny.

Scaling laws (like DeepMind’s Chinchilla work) provide the intuition: more compute and more data help, but not linearly. The paper is here: Training Compute-Optimal Large Language Models.

How much GPU time are we talking about? (order-of-magnitude)

We’ll stay honest: any estimate is sensitive to architecture, sequence length, efficiency techniques, hardware, and how many times you restart. But CFOs don’t need false precision—they need correct orders of magnitude.

A simple structure is:

Cost ≈ (GPU-hours × $/GPU-hour) + overhead
Overhead includes data pipelines, storage/egress, orchestration, evaluation, and people time.

Here are three illustrative scenarios (assumptions: modern high-end GPUs, significant experimentation, and production-grade evaluation; numbers are deliberately ranges):

7B continued pretraining (domain adaptation): think low single-digit millions to low tens of millions of GPU-hours depending on token count and efficiency. At typical cloud GPU pricing, this can land in the high six figures to several million dollars for compute alone, then multiply with iteration and staffing.
13B from-scratch: the training run grows materially, and so does the iteration loop. You’re now in “few million to tens of millions” in direct compute spend across multiple runs, plus a team that looks more like a research org than a product team.
70B from-scratch: this is where budgets explode. You need cluster-scale training, careful engineering to avoid instability, and enough run capacity to iterate. It’s the difference between “we can experiment” and “every run is a board-level event.” This is how serious efforts reach the $10M–$100M+ band.

If you want grounding on hardware concepts (not pricing), NVIDIA’s platform documentation is a good reference point for understanding memory bandwidth, tensor cores, and why inference and training behave differently: NVIDIA Documentation.

Cloud GPUs vs owning GPUs: the real trade

Cloud gives you speed to start and flexibility. It also gives you high unit costs and occasional capacity constraints. On-prem gives you a lower unit cost if—and it’s a big if—you can keep utilization high and staff the operational burden.

A simple checklist that usually holds:

If you can’t keep GPUs >60–70% utilized, buying rarely wins.
If you lack deep infra talent, the ops burden becomes an unplanned “sixth line item.”
Hybrid often wins: reserve stable capacity for inference; burst to cloud for training experiments.

For pricing reality checks, compare major cloud GPU pages (they change frequently): AWS EC2 pricing and Google Cloud GPU pricing.

GPU server racks illustrating LLM compute requirements for training and inference

Data is the second billion-dollar lever: tokens, quality, and legal rights

Compute gets headlines. Data determines whether you can ship. And in regulated industries, legal rights and provenance are what turns a prototype into a product.

How much data do you need—and what “quality” costs

The naive view is “more data is better.” The useful view is: you need diverse, clean, deduplicated data to avoid memorization and brittleness. Quality is a pipeline, not a property.

A typical pretraining dataset pipeline looks like:

Acquire/crawl data
Filter and deduplicate
Remove or mask PII and sensitive content
Tokenize and compute statistics (language mix, domain distribution)
Quality scoring and sampling

For fine-tuning, the work shifts from “billions of tokens” to “high-signal examples.” Turning messy customer support logs into an instruction-tuning dataset can require careful redaction, normalization of outcomes, and labeling of what “good” looks like. That work is often more labor than compute.

Licensing and provenance: the cost of being able to ship

Data rights are not abstract. They constrain what you can commercialize, which geographies you can serve, and how you respond to audits. Ownership vs licensing changes your downstream options.

Enterprises that need auditable lineage will find that procurement and legal become gating functions. If you can’t prove you have rights to train on a corpus—or if the corpus includes confidential client material without consent—you may block launch regardless of model performance.

This is where “AI total cost of ownership” becomes real: the cheapest dataset is the one you can confidently stand behind.

When your “data advantage” is actually a retrieval advantage

Most enterprises do have a proprietary “data advantage.” It’s just not a pretraining advantage. It’s a knowledge base: policies, manuals, contracts, tickets, and internal playbooks that change constantly. Training bakes in stale knowledge. Retrieval keeps it fresh.

Retrieval-augmented generation lets you keep data in place, update instantly, and control citations. For many teams, it’s the best alternative to in house large language model development because it aligns with enterprise reality: dynamic documents, access controls, and the need to explain answers.

Document review workflow showing pretraining dataset governance and token budget quality control

Team and burn rate: what you actually have to hire (and why)

There’s a temptation to view LLM work as “a few strong engineers plus a model.” That’s how demos happen. Production happens when you staff the unglamorous parts: data, evaluation, security, and operations.

Minimum viable team for a production LLM initiative

Here’s a practical minimum viable team for a 6-month MVP that ships a governed assistant (fine-tune + RAG or RAG-only). You can compress roles, but you can’t delete responsibilities.

Product owner (FT): success metrics, scope control, stakeholder alignment
LLM/app engineer (1–2 FT): prompts, tools, orchestration, UX integration
Data engineer (1 FT): ingestion, cleaning, retrieval pipelines, access controls
MLOps/Platform (0.5–1 FT): deployment, monitoring, cost controls, CI/CD
Security/compliance (0.25–0.5 FT): risk review, audit trails, controls, vendor assessment
QA/Evals (0.5–1 FT): evaluation sets, regression tests, failure mode tracking
Domain SMEs (part-time): labeling edge cases, defining “correct” outputs

Notice what’s missing: a research team. For many enterprise use cases, you don’t need it. You need disciplined product engineering with rigorous evals.

AI engineering team collaborating on MLOps for LLMs and evaluation planning

Salary ranges and the ‘coordination tax’

Salary bands vary by region, but in most markets senior AI/ML and platform roles command a premium. The bigger cost is the coordination tax: as work becomes research-heavy, iteration slows, and each additional dependency (data, compute capacity, SMEs, compliance) increases cycle time.

A simple way to think about it: if two key hires slip by three months, your “6-month MVP” becomes a 9-month MVP. Even if your cash burn stays flat, you’ve missed a quarter of learning and a quarter of operational savings. Opportunity cost is real, even when it doesn’t show up as a line item.

Why MLOps is where LLM projects go to die (or live)

MLOps for LLMs is less about fancy pipelines and more about operational integrity: model registry, prompt/versioning, evaluation gates, monitoring, and rollback plans. Without this, teams ship “one-off intelligence” that degrades quietly until it becomes a reputational problem.

Example: a bad rollout has no feature flags, no regression tests, and no ability to attribute failures to data vs prompts vs retrieval. A safe rollout ships behind flags, runs eval gates before releases, logs tool calls with permissions, and can roll back within minutes.

Ongoing costs: inference, refresh cycles, and the ‘LLM maintenance tax’

Training is a project. Inference is a product. The moment your assistant becomes useful, your bill becomes recurring, your uptime expectations rise, and your security surface expands.

Inference is the recurring bill you can’t ignore

Model inference costs are driven by context length, output length, concurrency, latency targets, and redundancy. If you give the model 20 pages of context “just to be safe,” you’ll pay for it on every call. If you require low latency with high concurrency, you’ll provision for peak.

The best cost lever is not “negotiate pricing.” It’s engineering:

Route simple queries to smaller/cheaper models, escalate complex ones
Cache repeated answers and retrieved snippets
Summarize long threads before re-injecting context
Cap retrieval and enforce relevance thresholds

Refresh and retraining: how often, and why it’s rarely ‘set and forget’

LLM systems drift. Your policies change. Your products change. Attack patterns change. Refresh can mean updating the retrieval corpus, re-fine-tuning, or (rarely) re-pretraining. Each has different cost and time profiles.

Example: a new regulation requires policy updates. With RAG, you can update the source document and the assistant improves the same day (with proper indexing and review gates). With retraining, you’re looking at weeks—plus risk of regressions.

Security and compliance accumulate over time

Governance becomes steady-state operations: audit logs, access reviews, quarterly red teaming, vendor reviews, incident response playbooks. This is why CFOs should treat LLM capability as an operational function, not a one-time procurement.

Build vs buy vs ‘adapt’: a decision framework that prevents nine-figure mistakes

When someone asks for a build vs buy large language model cost comparison, they usually want a yes/no answer. The better answer is a framework: match your constraints to the cheapest path that still delivers differentiation.

When from-scratch LLM development makes economic sense

From-scratch large language model development economics only work if you can amortize the investment. That generally means:

Massive scale (enough usage to spread fixed costs)
Unique data at scale with clear rights
Distribution to capture the value you create
Multi-year R&D tolerance and strong infra capability

Archetypes include hyperscalers, consumer platforms at enormous scale, and certain defense-grade specialized organizations. If you’re an enterprise trying to improve internal workflows, pretraining is usually the most expensive way to learn what you actually need.

When fine-tuning wins (and what it typically costs)

Fine-tuning wins when you need consistent behavior: specific formats, tone, controlled task performance, or “how we do things here.” With LoRA/adapters, you can often reduce compute and preserve an upgrade path to new foundation models.

Ballpark: initial fine-tune + evaluation + deployment can land anywhere from $50k–$500k depending on data readiness, governance requirements, and integration scope. That range is wide because the expensive part is rarely the training run; it’s the data and reliability work.

If you’re evaluating custom LLM development pricing, ask what’s included: curated eval sets, regression testing, monitoring, and model routing are the difference between a pilot and an asset.

When RAG wins (and why it’s the default for enterprises)

RAG wins when your advantage is proprietary knowledge, freshness, and governance. You pay for search, orchestration, and controls instead of pretraining. You get citations. You can update sources without retraining. You can constrain what the model is allowed to know and do.

For enterprise AI adoption, this is the default because it matches the real job: answer questions from approved documents, draft using templates, and take actions through audited tools.

Example: a contract-aware support assistant that retrieves relevant policy clauses, cites them, and drafts a response for human approval. That’s a workflow system with an LLM inside—not an LLM pretending to be a workflow system.

Executive decision scorecard for build vs buy large language model development

A simple scorecard executives can use in a meeting

Here’s a text-only scorecard you can use in a single meeting. Score each dimension 1–5 (low to high). The pattern matters more than the exact numbers.

Differentiation: does the model itself differentiate you, or the workflow?
Scale: will usage be high enough to amortize fixed costs?
Data rights: do you have clear rights to train and ship?
Privacy/latency constraints: do you require on-prem or strict residency?
Risk tolerance: can you tolerate novel failure modes?
Time-to-value: do you need results this quarter?

Interpretation:

Buy/host foundation models when time-to-value is high priority and differentiation is in product/workflow.
RAG when your knowledge base is proprietary, changes often, and needs citations/governance.
Fine-tune when behavior consistency matters and you can curate high-signal examples.
From-scratch only when scale + rights + multi-year tolerance are all high.

If you want an economically grounded assessment, our AI discovery and cost-to-value assessment is designed to turn this scorecard into a budget bracket, risk plan, and implementation roadmap.

A pragmatic roadmap for enterprises that still want “LLM capability”

Some organizations genuinely need an internal LLM capability—not because they want to train a new foundation model, but because they want durable leverage: faster workflows, better consistency, and safer automation.

This is where large language model development services for enterprises should focus: building capability that survives model changes.

Start with a workflow, not a model

Pick a high-frequency workflow with measurable success metrics: time-to-output, cost-to-serve, error rate, escalation rate, and compliance violations. Decide what the system is allowed to do: read-only recommendations, draft for human approval, or take actions.

Good starting points include support triage, document extraction, sales assistant prep, internal policy Q&A, and billing/invoice workflows. The common thread is that you can measure “before vs after” without relying on vibes.

Build an ‘LLM layer’ you can swap models under

To avoid lock-in, build an abstraction layer that contains prompts, tools, retrieval, evaluation gates, and logging. You should be able to swap a foundation model without rewriting your product. This is especially important as model capabilities and pricing change.

Model routing is the quiet superpower here: use small/cheap models for most calls, escalate to larger models when retrieval confidence is low or the task is complex. That’s how you keep inference costs proportional to value.

This is also where AI agents become real: tool-using systems that call CRM/ERP functions with permissions and audit logs. We often implement this through AI agent development for workflow automation, because the ROI lives in reduced handling time and fewer escalations—not in the novelty of the model.

Operationalize: evaluation, monitoring, and cost controls from day one

Ship like you expect to maintain it. That means an eval set before production, safety tests (prompt injection, data leakage, jailbreak attempts), and regression tests for every release. Put it behind feature flags and roll out gradually.

Cost controls are not optional; they’re the difference between “successful pilot” and “unbudgeted platform.” Practical levers include caching, summarization, context management, retrieval limits, and batch processing for non-urgent tasks.

Operational governance should include access control, data retention, and audit trails. If you can’t explain what the model saw and what tools it called, you can’t operate it safely at enterprise scale.

Customer support workflow environment where LLM agents reduce handling time and improve consistency

Conclusion: the honest economics of LLM capability

From-scratch large language model development is typically a $10M–$100M+ commitment once you include iteration, data, and talent. Compute is only one line item; governance, evaluation, and ongoing inference are where budgets surprise teams.

Most enterprises get better ROI by using foundation models with retrieval-augmented generation and selective fine-tuning—shipping in months, not years. The right question isn’t “Do we own an LLM?” but “Can we own a durable capability that improves workflows safely and cheaply?”

If you’re being asked to “build our own LLM,” let’s turn that into a CFO-grade plan. Buzzi.ai can map your use cases to a realistic build-vs-buy budget, then implement the fastest path to production value (often RAG + agents + governance) starting with our AI discovery and cost-to-value assessment.

FAQ

What does large language model development mean—pretraining, fine-tuning, or RAG?

In practice, “large language model development” can refer to three different efforts: training a foundation model from scratch, adapting an existing base model with continued pretraining, or fine-tuning an existing model for specific behaviors and tasks. Many enterprise deployments add retrieval-augmented generation (RAG) on top, which uses your documents as a governed knowledge source. The economics vary by orders of magnitude, so clarify the category before you approve any budget.

How much does it cost to develop a large language model from scratch in 2025?

For serious from-scratch efforts, the full large language model development cost commonly falls in the $10M–$100M+ band once you include multiple training runs, data acquisition and cleaning, a specialized team, and governance. Smaller “toy” models can be cheaper, but they usually won’t meet enterprise reliability and safety requirements. The budget is also highly sensitive to how many iterations you need to reach acceptable quality.

Why do LLM training costs reach $10M–$100M+ for serious efforts?

The direct llm training cost (compute) is only the visible part. Costs expand because you rarely do one run: you iterate on data filtering, architecture choices, hyperparameters, and evaluation cycles, and you pay for failed runs and restarts. Add in high-demand talent, data rights work, and ongoing reliability engineering, and budgets move from “project” to “program.”

How many GPU hours are required to train a 7B, 13B, or 70B parameter model?

There isn’t a single number, because GPU compute hours depend on tokens, sequence length, hardware, efficiency techniques, and how many times you rerun experiments. Order-of-magnitude, moving from 7B to 70B shifts you from “expensive but manageable” to “cluster-scale training where each iteration is a major event.” For most enterprises, this is a sign to consider fine-tuning or RAG rather than from-scratch training.

What are the biggest hidden costs in large language model development (MLOps, evals, compliance)?

The hidden costs usually live in MLOps for LLMs (monitoring, versioning, rollback), evaluation (curating test sets, SME labeling, red teaming), and compliance (audit trails, access controls, vendor reviews). These are “always-on” responsibilities, not one-time setup. If you skip them, you often get a demo that can’t survive real users and real adversarial inputs.

How do inference costs compare to training costs over a year of production usage?

Training is front-loaded and finite; model inference costs are recurring and scale with adoption. If your assistant becomes popular, the yearly inference bill can rival (or exceed) the initial build cost unless you optimize context length, routing, caching, and retrieval. This is why the best enterprise designs treat inference like a controllable unit economics problem, not a fixed expense.

Is it cheaper to train LLMs on cloud GPUs or buy on-premise GPU clusters?

Cloud GPU pricing is usually higher per hour, but it gives you speed, flexibility, and less operational burden—especially for bursty experimentation. On-prem can be cheaper if you can keep utilization high and you have the staffing to run it reliably (power, cooling, failures, scheduling, upgrades). Many organizations land on a hybrid approach: stable capacity for inference, cloud bursting for training runs.

How much proprietary data do you need before building your own LLM makes sense?

You need more than “lots of documents.” You need large quantities of high-quality tokens with clear usage rights, plus a plan to maintain provenance and audits. For many enterprises, the real advantage is retrieval: using proprietary documents via RAG with access controls and citations. If you want help deciding which bucket you’re actually in, start with a structured assessment like Buzzi.ai’s AI discovery and cost-to-value assessment.

When should an enterprise choose fine-tuning over retrieval-augmented generation?

Choose fine-tuning when you need consistent behavior: formats, tone, task-specific outputs, or “how we operate” preferences that aren’t contained in documents. Choose retrieval-augmented generation when correctness depends on up-to-date proprietary knowledge and you need citations and governance. In practice, many production systems combine them: RAG for facts and freshness, fine-tuning for behavior and reliability.

What is a practical build-vs-buy framework for LLM capability in the enterprise?

Use a scorecard across differentiation, scale, data rights, privacy constraints, risk tolerance, and time-to-value. If differentiation is in the workflow and time-to-value matters, buy/host a foundation model and invest in the LLM layer (retrieval, tools, evals, monitoring). Reserve from-scratch training for cases where you can amortize multi-year R&D across massive scale and you have unique data rights.