AI Mobile App Development: The On‑Device vs Cloud Bet You Can’t Undo
AI mobile app development hinges on one hard-to-reverse choice: on-device vs cloud inference. Use this framework to optimize latency, privacy, and cost.

Most teams treat the model as the big decision. In AI mobile app development, the bigger—and harder to reverse—decision is where inference runs: on-device inference, cloud inference, or both.
This sounds like an implementation detail. It isn’t. Where inference lives determines your UX ceiling (and your UX floor), your privacy posture, your unit economics, and how quickly you can ship improvements without breaking trust or blowing up cost.
The common failure mode goes like this: you ship a cloud-first MVP because it’s fast. Then you discover that the feature people love is exactly the one that can’t tolerate mobile network variance—latency spikes, retries, and quiet failures. Or you over-invest in on-device from day one, spend weeks shaving milliseconds and megabytes, and only then learn the UX wasn’t compelling enough to justify the work.
We’ll fix that with a practical decision framework: the latency benchmarks you should target, a back-of-the-envelope cost model you can adapt, privacy and data residency implications, and migration paths that keep you from getting trapped. At Buzzi.ai, we build tailor-made AI agents and AI-enabled apps in production, and we’ve learned that architecture decisions compound—just like product decisions do.
Why on-device vs cloud inference is the “foundational” decision
Choosing between on device vs cloud AI for mobile apps is foundational because it silently defines your product’s constraints. If the model is the “brain,” inference placement is the “nervous system”: it decides how signals move, how fast, and what happens when the connection gets cut.
The irreversible parts: latency budgets, data paths, and ops surface area
The first thing you lock in is a latency budget. Cloud-first assumes the network is part of your app. Device-first assumes your app can stand on its own. That’s not a philosophical difference; it’s a set of engineering commitments that show up everywhere.
Early choices that become expensive to unwind include:
- Network dependency: does the core feature hard-fail without connectivity, or degrade gracefully?
- API contracts: request/response payload shapes, versioning, and how clients handle partial results.
- Observability stack: what you log, where you log it, and how you debug across device + backend.
- Model update mechanism: app-store releases, remote model download, or server swaps.
- Compliance posture: what data leaves the device, where it’s stored, and what “consent” actually means in practice.
Here’s the demo-to-reality gap: you build a feature on fast Wi‑Fi in a clean environment. It works. Then you launch in a market where users are on congested networks, switching towers, or moving between 4G and “almost 4G.” Suddenly the same feature feels unreliable. The model didn’t change; the data path did.
At a high level you end up with three mobile AI architecture options:
- Device-first: inference runs on the phone; cloud is optional (analytics, updates, fallback).
- Cloud-first: inference runs on servers; the phone is mostly a thin client.
- Hybrid: the phone handles the “hot path,” the cloud handles the heavy path and exceptions.
Unit economics: inference is a variable COGS line item
Traditional mobile apps have upfront engineering costs and then relatively predictable variable costs. AI mobile app development adds something new: every inference can be a billable event. That means your cost of goods sold is no longer mostly bandwidth and storage; it can become model usage.
Cloud-first makes it easy to start but potentially expensive to scale. On-device shifts the cost curve: you invest more in engineering, optimization, and QA across a device matrix, but your marginal inference cost approaches zero as usage grows.
A simple back-of-the-envelope model makes this obvious:
Monthly inference volume ≈ MAU × inferences/user/day × 30
If you have 200,000 MAU and users trigger 8 AI calls/day, that’s 48 million inferences/month. Multiply by your $/1k inferences (plus bandwidth and other services) and you get a real, variable bill. The success case is also the cost explosion case.
The metric that matters is: $/active user/month for AI features. If that number rises with engagement, you’re building a tax on your own growth.
Privacy/compliance isn’t a policy doc—it's an architecture choice
Teams often treat privacy as something Legal will “handle” later. But user privacy, data residency, and consent flows are consequences of what data crosses the network. In edge computing for mobile, the simplest privacy win is also the most concrete: keep raw data on the phone.
On-device inference can reduce data collection, simplify consent UX, and mitigate data residency constraints because less personal data leaves the device in the first place. Cloud can still be the right answer—especially when you need centralized controls, auditability, or monitoring for safety and abuse—but it increases the “sensitive surface area” of your product.
Example: imagine voice notes. If you do on-device preprocessing (keyword spotting, speech-to-text, or redaction), the server might only receive sanitized text or embeddings. That changes what you must store, what you must protect, and what you must explain to users.
“Data minimization by design” is a product advantage, not just a compliance tactic.
On-device AI for mobile apps: where it wins (and where it hurts)
On-device inference is having a moment because modern phones ship with serious compute. But the story isn’t “phones are fast now.” The story is “you can buy reliability and privacy by paying with engineering effort.” In privacy focused on device AI mobile app development, that trade is often worth it.
What ‘good’ looks like: latency, offline UX, and reliability
Users don’t perceive average latency; they perceive hesitation. For real-time AI features, practical UX targets look like this:
- <100–200ms: feels instant for tap-to-result interactions (e.g., classify, rank, detect).
- <500ms: still “responsive,” especially with subtle UI feedback.
- >1s: you need progressive disclosure (streaming, skeleton states) or the feature feels flaky.
Offline capability is the other superpower. Not “works in airplane mode” as a party trick, but “doesn’t collapse when the network is mediocre.” In many emerging markets, that reliability is directly tied to retention.
Great on-device use cases tend to be small, frequent, and latency-sensitive:
- On-device text classification for inbox triage (“important” vs “later”).
- On-device image quality checks (blur detection before upload).
- Wake-word or keyword spotting that must run reliably.
Constraints you can’t negotiate with: battery, thermals, and memory
Mobile resource constraints aren’t academic. They’re the platform’s way of enforcing reality: battery drains, phones heat up, and operating systems punish misbehavior.
The constraints show up as:
- Battery and thermals: sustained inference triggers thermal throttling; performance degrades non-linearly.
- Memory pressure: large models compete with app UI, caches, and OS processes.
- Background limits: iOS and Android restrict long-running background work; you can’t assume you’ll finish.
- App size budgets: users abandon downloads; app stores impose limits; enterprises have MDM constraints.
- Device fragmentation: GPU/NPU acceleration varies wildly; mid-range Android devices are the median user.
A classic “what broke in production” story: the model runs fine on a flagship device during testing. Then users on mid-tier Android devices try the feature after 10 minutes of camera usage and the phone gets warm. The model slows, UI janks, and you get negative reviews describing it as “buggy.” It’s not buggy; it’s physics.
Tech levers: quantization, compression, and mobile runtimes
The good news is you have real levers. The bad news is each lever is a trade.
Quantization (e.g., int8 or even int4) and compression techniques can dramatically reduce model size and latency. But you often give up some accuracy and sometimes stability across devices. Your goal isn’t “maximize benchmark score”; it’s “meet UX and quality constraints on target phones.”
Mobile runtimes matter because they determine how easily you ship, optimize, and accelerate models:
- Apple’s Core ML documentation is the anchor for iOS deployment and hardware acceleration.
- TensorFlow Lite documentation is common for Android and cross-platform on-device inference, especially for quantization workflows.
- ONNX Runtime Mobile documentation helps when you want portability across runtimes and execution providers.
Micro-example: you start with a 200MB model that’s accurate but too slow and too big. With quantization and pruning, you might ship a ~50MB version that runs 2–4× faster. The tradeoff is usually a small accuracy hit and more time spent validating edge cases across devices.
Model update strategy is the final lever: ship models inside the app (stable, but slow to update) versus remote model download (faster iteration, but requires versioning, integrity checks, and rollback plans). A/B testing is possible either way, but it’s more operationally complex when models live on-device.
Cloud inference for AI mobile app development: the scaling story (and the hidden taxes)
Cloud inference is the default for a reason: it makes iteration fast, capability broad, and control centralized. But for AI mobile app development, the cloud also introduces two things users notice immediately: latency variance and dependency.
Cloud’s superpower: faster iteration and centralized control
Cloud-first lets you ship in days, not weeks. You can swap models, update prompts, tune pipelines, and deploy safety layers without waiting for app store approvals. That speed matters when you’re still searching for product-market fit.
Centralization also makes operations cleaner:
- Observability: consistent logs, traces, and quality monitoring.
- Rate limiting: protect yourself from abuse and accidental runaway usage.
- Policy enforcement: moderation, guardrails, and access controls.
Cloud is also the natural home for large models and complex multi-step systems. For example, server-side retrieval-augmented generation (RAG) that hits a vector database, applies policies, calls tools, and streams responses back is unrealistic to run fully on-device today.
The hidden taxes: latency variance, network reality, and uptime coupling
Teams tend to benchmark cloud inference on average latency. Mobile users experience the tail. p95 and p99 are where trust is made or lost.
Mobile networks have failure modes that desktop apps can ignore: packet loss, handovers between towers, captive portals, and sudden congestion. “Works on Wi‑Fi” is not a benchmark; it’s a best-case scenario.
As a rough illustration: you might see 300–600ms round trips on a good 4G connection for small payloads. In a congested environment, p95 can jump to multiple seconds with retries. Tools like Cloudflare Radar are useful for understanding the macro picture of network variability, but you still need to measure your own real user paths.
The deeper issue is uptime coupling. If inference is cloud-only, your AI feature’s SLA and uptime become your backend’s SLA and uptime. When the API is down, your product is down—at least where it matters.
Security and compliance: centralized can be simpler—until it’s not
Cloud can simplify security because controls are centralized: key management, audit trails, regional routing, and consistent policy enforcement. For enterprise buyers, that can be a selling point.
But centralization means more sensitive data in motion and at rest. Data residency requirements can force you into regional deployments and strict data flow documentation. If your feature moves images, voice, or identifiers, you need to assume it will be scrutinized.
Mitigations are usually architectural, not procedural:
- Encrypt in transit and at rest (table stakes).
- Redact on-device before upload (often overlooked).
- Route requests regionally to meet residency commitments.
- Minimize retention; avoid logging raw payloads by default.
Example: do on-device PII redaction before cloud inference. That single change reduces the blast radius of a breach, simplifies compliance reviews, and can improve user trust because your consent story becomes honest and simple.
Latency benchmarks to target (and how to measure them in a pilot)
Latency benchmarks are only useful if they reflect reality: real devices, real networks, real usage patterns. The goal isn’t to publish a number; it’s to hit a UX promise consistently.
Measure the right thing: p50 vs p95 vs ‘rage taps’
p50 is the median experience. p95 is the experience of your frustrated users. p99 is where your app gets labeled “unreliable.” For user experience latency, tail behavior is the product.
In a two-week pilot, instrument both analytics and logs. Capture metrics like:
- Time-to-first-result (TTFR)
- Time-to-final-result (TTFR-final)
- p50/p95/p99 end-to-end latency
- Cancellation rate (user backs out before result)
- Retry count per request
- Error rate segmented by network type
- “Rage taps” or repeated taps within a short window
- Fallback rate (if you have hybrid routing)
If you’re unsure what to instrument first, an AI Discovery workshop is often the fastest way to define a pilot plan that maps metrics to decisions instead of collecting data for its own sake.
Back-of-envelope latency budget: where time actually goes
Think in budgets, not blame. On-device inference latency is typically: input capture → preprocessing → inference → post-processing → UI render. Cloud inference adds: upload → queue → inference → download → render.
Voice is a good narrative walkthrough because it makes the pipeline concrete. The moment you tap “send,” you’re paying for audio capture, possibly on-device noise suppression, then either on-device transcription or upload, then server processing, then response download and UI updates. Your biggest knobs are payload size, streaming partial results, caching, and batching where appropriate.
Practical benchmark ranges (not promises)
Ranges depend on device tier and model size, but as a practical starting point:
- On-device small models can often land in the 20–200ms range on newer devices when warmed up.
- Cloud calls commonly land in the 300ms–2s range end-to-end depending on payload and network; p95 can be substantially worse.
To make these numbers actionable, build a reproducible benchmark suite:
- Fixed input sets (same prompts/images/audio clips)
- Warm vs cold start measurements
- Foreground vs background constraints
- Device tier coverage (flagship, mid-range, older)
- Geography and network coverage (Wi‑Fi, 4G, congested)
Cost model: compare on-device vs cloud AI with your actual usage
Most architecture debates end when you quantify inference cost. Not because cost is the only thing that matters, but because it makes tradeoffs legible. If you don’t model cost at 10× usage, you’re effectively choosing to be surprised later.
Cloud cost formula (and what teams forget to include)
Cloud cost is rarely just $/1k inferences. The more realistic formula includes:
- Model usage: $/1k inferences or $/token
- Bandwidth/egress (especially for images/audio)
- Caching layers (to reduce repeated calls)
- Vector database and retrieval infrastructure (if using RAG)
- Observability/logging costs
- Ops overhead: on-call, incident response, rate limiting, abuse prevention
Scenario math (illustrative, use your own pricing):
100k MAU × 10 inferences/user/day × 30 ≈ 30M inferences/month. At $0.50 per 1k inferences, that’s ~$15k/month before egress, retrieval, and logging.
1M MAU at the same usage is ~300M inferences/month, or ~$150k/month on the same assumptions. If your AI feature becomes core, this is no longer a rounding error—it’s a strategic constraint.
On-device cost model: shifting cost to engineering + QA + support
On-device inference doesn’t eliminate costs; it relocates them. You pay in engineering effort (optimization, quantization), QA across devices, and ongoing regression testing as OS versions and chipsets evolve.
Typical cost components include:
- Optimization work (often 2–4 weeks for a first serious pass)
- Building a model delivery pipeline (download, versioning, rollback)
- Stability work: crashes/ANRs, memory spikes, background behavior
- Ongoing performance regression testing cadence (monthly/quarterly)
The payoff is that marginal inference cost approaches zero and your AI feature is less exposed to cloud API pricing changes or bandwidth limitations. This is why on-device investments often make sense after you’ve proven engagement: the ROI scales with usage.
Hybrid economics: pay cloud only when it’s worth it
Hybrid on device cloud AI architecture for mobile apps is a pragmatic answer because it lets you use the cloud selectively. You run the default path on-device and reserve cloud for the cases where it earns its keep: complex requests, low-confidence outputs, or premium tiers.
A simple routing rule set might look like:
- If offline or poor network → use on-device model.
- If device battery < 15% or thermal state is high → prefer cloud for heavy tasks.
- If on-device confidence < threshold → escalate to cloud.
- If request requires large context/tool access → cloud.
- If user opts into “privacy-first mode” → force on-device where possible.
This approach often reduces cloud calls dramatically while preserving quality in edge cases. It also gives you a path to iterate: you can measure which requests escalate and decide whether to optimize on-device coverage over time.
Privacy, compliance, and data residency: architecture patterns that pass review
Privacy reviews go faster when your architecture is simple. That’s why “collect less” beats “protect more.” If you need an organizing principle, the NIST Privacy Framework overview is a solid reference for structuring decisions in a way security and compliance teams recognize.
Data minimization patterns: keep raw data on the phone
The strongest privacy pattern in AI-powered mobile apps is straightforward: keep raw user data on-device and only transmit derived artifacts when necessary.
Practical patterns include:
- On-device preprocessing: embeddings, redaction, summarization before upload.
- Ephemeral processing: avoid storing raw payloads server-side by default.
- Logging discipline: log metadata and quality signals, not raw content.
Examples that reviewers understand immediately: on-device face blurring before cloud computer vision; on-device speech-to-text and then send only text (or even just intent tags) to your backend.
When cloud is mandatory: auditability, centralized policy, and enterprise buyers
Sometimes cloud inference is not optional. Enterprise procurement often requires centralized audit logs, retention controls, and policy enforcement. Abuse monitoring and safety systems also tend to work better when centralized.
Data residency adds nuance: you may need regional routing, regional storage, and clear documentation of who is the data controller vs processor. For an MVP, the minimum viable posture is: encryption, regional deployment if required, least-privilege access, and a strong default of not storing raw content.
Federated learning and on-device training: useful, but not your first move
Federated learning can improve models while keeping training data on devices. Conceptually, devices train locally and only share gradients/updates for aggregation. Practically, it’s complex: distribution drift, uneven participation, testing difficulty, and even poisoning risk.
If you want the foundational reference, see “Communication-Efficient Learning of Deep Networks from Decentralized Data” on arXiv. But our pragmatic recommendation is: decide inference placement first. Federated learning is an optimization you add when the ROI is clear.
It tends to make sense when personalization matters and inputs are sensitive—keyboard-like prediction, local preference learning, or cohort-based edge improvements.
Decision framework: choose on-device, cloud, or hybrid in 30 minutes
If you want to know how to choose on device or cloud AI for mobile app features quickly, you start with UX, then constraints, then economics. Not the other way around. If you’re building a product (not a demo), the best architecture for AI powered mobile apps is the one that remains viable as usage and expectations grow.
This is also where it helps to talk with a team that ships these systems end-to-end. Our AI-enabled mobile app development work is typically less about “adding AI” and more about choosing the right architecture so the feature survives contact with the real world.
Start with the UX requirement: real-time, offline, and ‘instant’ interactions
UX requirements are the only non-negotiables. Everything else is a workaround.
- If the feature must work offline → on-device or hybrid with offline mode.
- If the interaction is truly real-time (camera loop, voice turn-taking) → prioritize on-device; use cloud for heavy lifting.
- If the interaction is asynchronous (recommendations, summaries, back-office enrichment) → cloud is often fine.
Mini mappings:
- Barcode-like vision guidance while the camera is live → device-first.
- Customer support drafting that can take a second and benefits from context/tools → cloud-first.
- Nightly personalization updates → cloud, then cache results on-device.
Then apply constraints: privacy, data residency, and buyer expectations
This is where privacy focused on device AI mobile app development becomes a strategic decision instead of a checkbox.
- If data is highly sensitive or residency is strict → on-device first, cloud second with minimization.
- If enterprise buyers require auditability and centralized controls → cloud or hybrid with strong logging and retention policies.
- If consumer trust is core → default-local where feasible and make consent UX honest.
Simple If/Then rules a PM can actually use:
- If the feature touches voice/images of private spaces → prefer on-device preprocessing.
- If you can’t explain the data flow in one sentence → simplify the architecture.
- If a breach would be catastrophic → minimize what leaves the phone.
Finally, model the economics: what happens at 10× usage?
Run your AI mobile app development cost model at current usage, target usage, and upside usage. Then ask: where does inference cost become uncomfortable? Where does it become impossible?
A typical break-even narrative looks like this:
Cloud-first makes sense at 50k MAU because it’s fast and cheap enough. Around 250k MAU, hybrid becomes attractive because you can offload the hot path and cut cloud calls. At 1M+ MAU, device-first for the core interaction can pay back quickly—especially if engagement is high and the AI feature is used many times per day.
What matters isn’t the exact MAU number. It’s whether your cloud bills scale with engagement in a way that compresses margins or limits product expansion.
Migration paths: how to change your mind without a rewrite
The best time to keep options open is before you need them. The second-best time is now. Migration is easiest when you design a shared contract: the same inputs, outputs, and quality signals regardless of where inference runs.
Cloud → on-device: carve out the ‘hot path’ first
If you started cloud-first (as many teams do), don’t try to move everything on-device at once. Move the high-frequency, latency-sensitive calls first—the ones users feel most often.
A step-by-step migration plan from cloud-only classification to on-device classification with cloud fallback:
- Identify the hot path: top 1–2 inference calls by frequency and UX sensitivity.
- Define a strict input/output contract and add versioning.
- Choose a mobile runtime and build a baseline on-device model.
- Optimize with model compression and quantization until you hit latency/size targets.
- Ship with a feature flag; compare quality and performance on real devices.
- Add remote model download, integrity checks, and rollback.
- Keep cloud as fallback for low-confidence cases until the on-device model is proven.
The outcome is not just lower latency. It’s less uptime coupling and more predictable unit economics.
On-device → cloud: when accuracy, safety, or capability demands it
Sometimes you outgrow on-device. Signals include: the model needs to be much larger, needs fresh server-side data, or requires complex tool use and safety layers that are hard to ship locally.
The trick is to move without a UX regression:
- Cache common results on-device to reduce repeated calls.
- Gracefully degrade when offline (queue, async completion, or simplified on-device fallback).
- Harden endpoints: rotate keys, rate limit, and monitor abuse.
Example: you start with on-device OCR for simple capture. Then you need robust document understanding for complex layouts and forms. Moving that heavy lift to a scalable AI backend makes sense—if you keep the “fast feel” with caching and good UI states.
Hybrid routing patterns that keep options open
Hybrid routing is the architecture that ages best because it gives you multiple dials to turn: cost, quality, privacy, and latency. You can treat cloud as an escalation path rather than the default.
A routing playbook:
- Confidence-based routing: on-device predicts; cloud handles low-confidence outputs.
- Network-aware routing: avoid cloud when latency benchmarks are unlikely to be met.
- Tier-based routing: premium users get heavier cloud features; free users stay mostly on-device.
- Async for heavy jobs: queue work, notify on completion, stream partials when possible.
What to log so you can tune the router: predicted confidence, device tier, thermal state, network type, end-to-end latency, fallback rate, and user cancellation rate. Over time you can make routing smarter and cheaper without changing the product surface.
Conclusion: architecture beats model choice
In AI mobile app development, inference placement (on-device vs cloud) is the decision that shapes UX, cost, and privacy—and it’s the hardest to reverse. On-device wins on latency, offline reliability, and data minimization. Cloud wins on iteration speed, centralized control, and advanced capabilities. Hybrid is often the default for teams that need both.
Before you commit, benchmark p95 latency on real devices and real networks, and run a simple cost model at 10× usage. Then design contracts and model delivery so you can migrate without rewriting your app.
If you’re planning AI features for a mobile product, ask Buzzi.ai to pressure-test your assumptions with a benchmark-and-cost pilot—before you lock in the wrong architecture. Explore our AI-enabled mobile app development service to build and scale AI-powered mobile apps with the right architecture from day one.
FAQ
What is the difference between on-device and cloud AI in mobile app development?
On-device AI runs inference locally on the phone using the device CPU/GPU/NPU, so results can be fast and can work without a network. Cloud AI sends inputs to a server, runs inference there, and returns results to the app.
In practice, the difference is less about “where the compute happens” and more about what gets locked in: latency variance, offline capability, privacy exposure, and operational complexity. That’s why this decision is foundational in AI mobile app development.
How do I decide whether my AI mobile app should run inference on-device or in the cloud?
Start with the UX requirement: if the feature must be instant, real-time, or offline, you should lean on-device or hybrid. If it’s asynchronous or benefits from large models and centralized tools (RAG, tool calling), cloud may be the better fit.
Then apply constraints (privacy, data residency, enterprise audit needs) and run a cost model at 10× usage. The “right” answer is usually the one that still works when you’re successful.
What latency can I realistically expect from on-device AI vs cloud AI for mobile apps?
On-device inference for smaller models can often hit tens to a couple hundred milliseconds on newer devices, especially after warm-up. Mid-tier phones may be slower, and thermal throttling can degrade performance over time.
Cloud inference can be fast on good networks, but the problem is tail latency: p95 and p99 can jump to seconds on congested mobile connections. That’s why you should benchmark on your target devices and geographies rather than relying on lab results.
How does the choice between on-device and cloud AI affect mobile app user experience?
On-device tends to feel more reliable because it reduces network dependency and enables offline capability. That reliability often shows up as higher trust: users stop “double tapping” and learn the feature will respond.
Cloud can feel great when connectivity is strong, but it introduces latency variance and uptime coupling. If your AI is core to the experience, cloud-only can turn minor backend issues into major product issues.
What are the cost implications of on-device vs cloud AI for a high-usage mobile app?
Cloud inference is a variable cost that scales with usage: more engagement means higher inference cost, plus egress, logging, and retrieval infrastructure if you use it. That can compress margins in the success case.
On-device shifts cost to engineering, optimization, and QA across device tiers; once shipped, marginal inference cost approaches zero. Many teams end up with a hybrid approach to pay cloud only when it adds measurable value.
How does AI architecture impact privacy, compliance, and data residency for mobile apps?
Architecture determines what data leaves the device, which in turn shapes consent UX, storage obligations, and residency constraints. On-device preprocessing and inference can materially reduce sensitive data exposure by keeping raw inputs local.
Cloud architectures can satisfy enterprise requirements for audit logs and centralized policy, but they increase the amount of sensitive data in motion and at rest. If you need help mapping requirements to a workable system, Buzzi.ai’s AI Discovery workshop is designed to pressure-test these tradeoffs early.
When is a hybrid on-device and cloud AI architecture the best option for a mobile app?
Hybrid is best when you need on-device speed and offline reliability for the “hot path,” but still want cloud capability for complex requests, low-confidence cases, or features that require tools and fresh data.
It also future-proofs your roadmap: you can shift traffic between device and cloud as pricing changes, models improve, or user expectations evolve—without forcing a full rewrite.
What hardware constraints (GPU/NPU, memory, battery) limit on-device AI on phones?
Battery and thermals are the most punishing constraints because they change over time: a model that’s fast for the first minute can slow dramatically after sustained use. Memory pressure can also cause crashes or force the OS to kill background work.
Hardware acceleration varies by chipset and OS version, so you need to test across a representative device matrix. In AI mobile app development, “works on my phone” is not a performance strategy.


