ANTIGRAVITY LABJP
Articles/App Development
App Development/2026-06-14Intermediate

How Far Can On-Device Inference Stay Free? Measuring the Line Between Foundation Models and Gemini

WWDC 2026 widened Apple Foundation Models' free tier, making on-device inference easier to reach for. But 'free, so everything on-device' hits cases where quality falls short. Here is how to decide the range you hold on-device and the range you pass to cloud Gemini by measurement, not guesswork.

Apple Foundation Models4iOS 272on-device inferenceGemini9fallback3cost management4indie development11measurement

Premium Article

After WWDC 2026, I tried shifting a small app's summarization feature onto on-device inference. The free tier had widened, so I wondered whether I could move what I'd been paying the cloud for onto the device.

Summarizing short text completed on-device surprisingly cleanly. But when I handed it long text laced with jargon, or text where several topics tangled together, the summaries increasingly skimmed the surface and missed the core.

The initial hope of "free, so everything on-device" breaks down right here. Yet reverting to "quality feels risky, so everything to the cloud" means not using the free tier that just widened, paying a network round trip even for a short summary.

Here, too, the answer is where to draw the line: how far to hold on-device, and from where to pass to the cloud. And drawn by feel, this line always leans too far one way. It's a line to draw by measurement.

See the dividing line with three rulers

Whether to use on-device or cloud can't be decided on a single criterion. I learned to look along three separate axes.

The first is quality: is the on-device model's output good enough to be usable for that task? This part is hard to put a number on, but the comparison I'll touch on later lets you build a proxy metric.

The second is latency. It may surprise you, but for short tasks on-device is often faster. With no network round trip, the first character returns sooner. Conversely, for long generation, the device's processing power tops out and the cloud can be faster. The intuition that matters is that it flips with task length.

The third is cost. On-device is effectively zero within the free tier. The cloud accumulates with every call. But making "free" the top priority while sacrificing quality drives users away and costs more in the end. Cost is an axis to view paired with quality, not alone.

Measure these three properly, once, against your app's representative requests, and the dividing line — "this kind of task is fine on-device," "this kind is better passed to the cloud" — becomes visible with grounding rather than guesswork.

Measure it once, in your own app

The three axes click into place faster after one measurement than after any amount of thinking. No elaborate benchmarking needed. Prepare about ten representative requests, feed the same input to both on-device and cloud, and just record the processing time and output side by side.

// A one-off measurement that records latency and output for both paths
func benchmark(_ inputs: [String]) async {
    for text in inputs {
        let t0 = Date()
        let local = try? await onDevice.summarize(text)
        let localMs = Date().timeIntervalSince(t0) * 1000
 
        let t1 = Date()
        let cloud = try? await cloudFallback.summarize(text)
        let cloudMs = Date().timeIntervalSince(t1) * 1000
 
        let faster = localMs < cloudMs ? "on-device" : "cloud"
        print(String(format: "len=%4d on-device=%.0fms cloud=%.0fms faster=%@",
                     text.count, localMs, cloudMs, faster))
    }
}

When I ran this in my app, on-device was consistently faster while inputs were short, and cloud overtook it somewhere past roughly 800 characters. For quality, reading the ten outputs side by side and marking the on-device ones that missed the core is enough to sense from what length it gets shaky.

You run this measurement once before drawing the line, not every time. Keeping the numbers at hand turns later threshold tuning from a vague feeling into something grounded.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
A decision procedure that measures the line between on-device and cloud inference on three points: quality, latency, and cost
A confidence-based fallback implementation that sends only the requests on-device inference can't answer to Gemini
An aggregation script for reviewing afterward how much of your traffic the free tier actually covered
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

App Dev2026-06-13
Delegate to the OS AI or Own It: Drawing the iOS 27 Feature Boundary
WWDC 2026 put AI into the OS core. Which app features should you hand to the OS AI, and which should you own? A boundary you will not regret, for indie developers.
App Dev2026-06-13
Apple Foundation Models Are Now Free for Most Indie Apps — Three Questions That Decide What I Build
Apple Foundation Models are now free for developers under 2M first-time downloads. Three questions I used to decide which AI features belong in my wallpaper apps.
App Dev2026-06-13
Making Apple Foundation Models and Gemini Interchangeable: A Three-Tier Abstraction for In-App AI
After WWDC26 opened Apple Foundation Models to qualifying developers and announced server-side Gemini integration, I redesigned my apps around a three-tier abstraction — on-device, Private Cloud Compute, and third-party APIs — behind a single Swift protocol.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →