How Far Can On-Device Inference Stay Free? Measuring the Line Between Foundation Models and Gemini
WWDC 2026 widened Apple Foundation Models' free tier, making on-device inference easier to reach for. But 'free, so everything on-device' hits cases where quality falls short. Here is how to decide the range you hold on-device and the range you pass to cloud Gemini by measurement, not guesswork.
After WWDC 2026, I tried shifting a small app's summarization feature onto on-device inference. The free tier had widened, so I wondered whether I could move what I'd been paying the cloud for onto the device.
Summarizing short text completed on-device surprisingly cleanly. But when I handed it long text laced with jargon, or text where several topics tangled together, the summaries increasingly skimmed the surface and missed the core.
The initial hope of "free, so everything on-device" breaks down right here. Yet reverting to "quality feels risky, so everything to the cloud" means not using the free tier that just widened, paying a network round trip even for a short summary.
Here, too, the answer is where to draw the line: how far to hold on-device, and from where to pass to the cloud. And drawn by feel, this line always leans too far one way. It's a line to draw by measurement.
See the dividing line with three rulers
Whether to use on-device or cloud can't be decided on a single criterion. I learned to look along three separate axes.
The first is quality: is the on-device model's output good enough to be usable for that task? This part is hard to put a number on, but the comparison I'll touch on later lets you build a proxy metric.
The second is latency. It may surprise you, but for short tasks on-device is often faster. With no network round trip, the first character returns sooner. Conversely, for long generation, the device's processing power tops out and the cloud can be faster. The intuition that matters is that it flips with task length.
The third is cost. On-device is effectively zero within the free tier. The cloud accumulates with every call. But making "free" the top priority while sacrificing quality drives users away and costs more in the end. Cost is an axis to view paired with quality, not alone.
Measure these three properly, once, against your app's representative requests, and the dividing line — "this kind of task is fine on-device," "this kind is better passed to the cloud" — becomes visible with grounding rather than guesswork.
Measure it once, in your own app
The three axes click into place faster after one measurement than after any amount of thinking. No elaborate benchmarking needed. Prepare about ten representative requests, feed the same input to both on-device and cloud, and just record the processing time and output side by side.
// A one-off measurement that records latency and output for both pathsfunc benchmark(_ inputs: [String]) async { for text in inputs { let t0 = Date() let local = try? await onDevice.summarize(text) let localMs = Date().timeIntervalSince(t0) * 1000 let t1 = Date() let cloud = try? await cloudFallback.summarize(text) let cloudMs = Date().timeIntervalSince(t1) * 1000 let faster = localMs < cloudMs ? "on-device" : "cloud" print(String(format: "len=%4d on-device=%.0fms cloud=%.0fms faster=%@", text.count, localMs, cloudMs, faster)) }}
When I ran this in my app, on-device was consistently faster while inputs were short, and cloud overtook it somewhere past roughly 800 characters. For quality, reading the ten outputs side by side and marking the on-device ones that missed the core is enough to sense from what length it gets shaky.
You run this measurement once before drawing the line, not every time. Keeping the numbers at hand turns later threshold tuning from a vague feeling into something grounded.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A decision procedure that measures the line between on-device and cloud inference on three points: quality, latency, and cost
✦A confidence-based fallback implementation that sends only the requests on-device inference can't answer to Gemini
✦An aggregation script for reviewing afterward how much of your traffic the free tier actually covered
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Once you can see the line, you translate it into implementation. What you want to avoid is routing fixedly by task type. Even the same "summarize" is fine on-device for short, plain input and needs the cloud for long, complex input. You want to switch by the difficulty of each item, not the type.
So you make it two-tier: try on-device first, and only when you can't trust the result, route to the cloud.
import FoundationModels// Try on-device; route to Gemini only when confidence is lowstruct AdaptiveSummarizer { let onDevice: SystemLanguageModel let cloudFallback: GeminiClient func summarize(_ text: String) async throws -> Summary { // Try on-device first let local = try await onDevice.summarize(text) // High enough confidence: keep it (don't call the cloud) if local.confidence >= 0.7 && !needsCloud(text) { return Summary(text: local.text, source: .onDevice) } // Send to the cloud only inputs we can't trust, or that are clearly not device-friendly let cloud = try await cloudFallback.summarize(text) return Summary(text: cloud.text, source: .gemini) } // A light entrance check that rejects inputs clearly hard for the device private func needsCloud(_ text: String) -> Bool { let longEnough = text.count > 1_200 let manyTopics = text.components(separatedBy: "\n\n").count > 6 return longEnough && manyTopics }}enum SummarySource { case onDevice, gemini }struct Summary { let text: String; let source: SummarySource }
This combines two checks. needsCloud is a coarse entrance sieve that routes long, many-topic input to the cloud before even trying on-device — because running on-device inference once and then discarding it for input clearly hard for the device is wasteful. confidence is the finer check afterward, dropping to the cloud when the on-device result is unreliable.
Thanks to this two-tier setup, calls to the cloud narrow down to "requests the device genuinely couldn't handle." Short, plain requests complete without touching the network, and only the hard requests cross outside the free tier.
Review how much the free tier covered
Once the fallback is in, you'll want to confirm later that it works as intended. If the share routed to the cloud is too high, the on-device confidence threshold may be too strict; if it's too low and quality complaints arrive, the threshold may be too loose.
So record which path handled each summary and aggregate it periodically.
// Record fallback outcomes and review how much the free tier coveredactor FallbackMeter { private var onDevice = 0 private var cloud = 0 func record(_ source: SummarySource) { switch source { case .onDevice: onDevice += 1 case .gemini: cloud += 1 } } func report() -> String { let total = onDevice + cloud guard total > 0 else { return "no data yet" } let onDeviceRate = Double(onDevice) / Double(total) * 100 return String( format: "on-device %.1f%% (%d) / cloud %.1f%% (%d)", onDeviceRate, onDevice, 100 - onDeviceRate, cloud ) }}
In my small summarization feature, I tuned the threshold once while watching this number. I'd started strict at confidence >= 0.85, and the cloud side had swelled to nearly 40%. Reading the outputs side by side, many of the on-device summaries in the 0.7 range were plenty usable, so lowering the threshold to 0.7 cut cloud calls by roughly half by feel. Quality complaints didn't rise.
The important point is that once set, this threshold isn't fixed forever. On-device models change performance with OS updates, and the content you handle drifts over time too. Keeping the aggregation at hand lets you notice when the fallback rate quietly creeps up.
The line isn't fixed; update it by measurement
That the on-device free tier widened is a plainly welcome change for an indie developer. But whether you receive that benefit fully hinges on whether you can properly measure "how far to hold on-device."
Lean everything on-device and you drop quality; revert everything to the cloud and you throw away the free tier. Measure the dividing line somewhere between them on quality, latency, and cost; implement it with confidence-based fallback; review it by the fallback rate. Keep this loop and you can redraw the line even as on-device models evolve or cloud pricing shifts.
Making full use of the free tier and not compromising on quality don't conflict. With a mechanism that measures "is the device enough for this?" per item and switches, I believe you can hold both at once.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.