Making Apple Foundation Models and Gemini Interchangeable: A Three-Tier Abstraction for In-App AI
After WWDC26 opened Apple Foundation Models to qualifying developers and announced server-side Gemini integration, I redesigned my apps around a three-tier abstraction — on-device, Private Cloud Compute, and third-party APIs — behind a single Swift protocol.
The day after WWDC26 wrapped, I opened the source of one of my production apps and winced. The Gemini API client was being called directly from view-layer code in far more places than I remembered.
With Apple Foundation Models now opened free of charge to qualifying developers, and a server-side integration announced that lets you call Claude or Gemini through the same Swift API, hard-wiring a specific model deep into your codebase is a debt your future self will have to pay.
For an independent developer, switching model providers is not a contingency — it is an annual event. Over the past two years I have swapped the backends for summarization, translation, and image description more times than I care to count, and each swap meant editing app code. I wanted that to stop. What follows is the three-tier abstraction I settled on, with the Swift code to back it up.
Why provider-coupled code breaks down within a year
The problem with direct coupling is not that the code stops working. It keeps working — and becomes impossible to change.
Three stiffening points show up in practice.
Provider-specific request shapes leak into the UI layer. If your view models assemble Gemini request structs directly, a provider change becomes a UI-layer refactor
You cannot track pricing and eligibility changes. Apple's free tier reportedly draws the line at two million first-time downloads. Lines like that move — your app grows, policies get revised. Rewriting every call site each time eligibility shifts is not realistic
Fallbacks become unwritable. A cascade like "try on-device, then fall back to the cloud" is only implementable when every call goes through one unified entry point
In my own app, direct imports of the Gemini client were scattered across 14 files. Seeing that number is what finally pushed me to build the abstraction layer.
Think in three tiers from summer 2026 onward
After WWDC26, the AI execution environments available to an iOS app settle into three tiers.
Tier 1, on-device (the Foundation Models framework). Lowest latency, works offline, costs nothing extra. It cedes vocabulary and long-form coherence to the upper tiers, but it is plenty for classification, short generation, and keyword extraction
Tier 2, Private Cloud Compute. The tier covered by the newly announced free access. It accepts image input and, while off-device, runs on Apple's privacy infrastructure
Tier 3, third-party APIs such as the Gemini API. The highest performance ceiling, billed per use. The announced server-side integration is expected to expose this tier through the same Swift API as well
The key design move is not deciding which tier each feature uses. It is building a structure that can move between tiers first, so the decision itself stays swappable. The decision criteria come later in this article.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A Swift pattern that unifies on-device, Private Cloud Compute, and third-party APIs behind one protocol
✦A working AIRouter implementation that pulls fallback order and timeouts out of call sites
✦Concrete routing criteria — privacy, latency, and cost — for deciding which tier serves each request
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Define the single interface that every AI call in the app depends on. The problem this code solves: call sites should never know which provider they are talking to.
// Every AI call in the app depends on this protocol and nothing elseprotocol AITextClient: Sendable { var tier: AITier { get } func generate(_ request: AIRequest) async throws -> AIResponse}enum AITier: Int, Comparable, Sendable { case onDevice = 0 case privateCloud = 1 case thirdParty = 2 static func < (lhs: Self, rhs: Self) -> Bool { lhs.rawValue < rhs.rawValue }}struct AIRequest: Sendable { let prompt: String let maxTokens: Int let privacy: PrivacyClass // Nature of the input data; used for routing let deadline: TimeInterval // How long the call site is willing to wait, in seconds}enum PrivacyClass: Sendable { case sensitive // Contains personal data or user-created content case standard // May leave the device, but not our managed boundary case open // May be sent to any tier}struct AIResponse: Sendable { let text: String let servedBy: AITier // Recording which tier answered pays off in ops analysis}
Putting privacy and deadline on AIRequest is the heart of this design. Instead of declaring which model to use, the call site declares what this call can tolerate. Provider selection becomes the router's job, and call-site code does not change when the model does.
The other practical payoff is testability. Write one stub conforming to the protocol and every screen's AI integration becomes unit-testable without touching the network. Branches I could previously only exercise through UI tests now run in tens of milliseconds.
Each provider then conforms to the protocol.
struct GeminiClient: AITextClient { let tier: AITier = .thirdParty private let apiKey: String // Injected from the Keychain — never hard-coded init(apiKey: String) { self.apiKey = apiKey } func generate(_ request: AIRequest) async throws -> AIResponse { var urlRequest = URLRequest(url: URL(string: "https://generativelanguage.googleapis.com/v1/models/gemini-pro:generateContent?key=\(apiKey)")!) urlRequest.httpMethod = "POST" urlRequest.setValue("application/json", forHTTPHeaderField: "Content-Type") urlRequest.httpBody = try JSONEncoder().encode( GeminiPayload(prompt: request.prompt, maxOutputTokens: request.maxTokens)) let (data, _) = try await URLSession.shared.data(for: urlRequest) let decoded = try JSONDecoder().decode(GeminiResult.self, from: data) return AIResponse(text: decoded.text, servedBy: .thirdParty) }}
Inject the API key at initialization from the Keychain rather than leaving a placeholder in source. As a side benefit, mocking the client in tests becomes trivial.
Fallback order is policy — keep it out of call sites
Start writing multi-tier fallback with do-catch at each call site and the same branching multiplies across your app. Containing it in a router is what holds up over years of operation.
enum AIRouterError: Error { case allTiersFailed case timedOut}struct AIRouter { let clients: [AITextClient] // Pass in ascending tier order func generate(_ request: AIRequest) async -> Result<AIResponse, AIRouterError> { for client in orderedClients(for: request) { do { let response = try await withTimeout(seconds: request.deadline) { try await client.generate(request) } return .success(response) } catch { continue // Give up on this tier and move to the next } } return .failure(.allTiersFailed) } private func orderedClients(for request: AIRequest) -> [AITextClient] { switch request.privacy { case .sensitive: // Input containing personal data never leaves the device return clients.filter { $0.tier == .onDevice } case .standard: return clients.filter { $0.tier <= .privateCloud } case .open: return clients } }}func withTimeout<T: Sendable>( seconds: TimeInterval, _ work: @escaping @Sendable () async throws -> T) async throws -> T { try await withThrowingTaskGroup(of: T.self) { group in group.addTask { try await work() } group.addTask { try await Task.sleep(nanoseconds: UInt64(seconds * 1_000_000_000)) throw AIRouterError.timedOut } guard let result = try await group.next() else { throw AIRouterError.allTiersFailed } group.cancelAll() return result }}
Why write it this way? Because the policy — fallback order, cutoff time, privacy boundary — lives only in AIRouter and orderedClients. When some tier's terms change next year, there is exactly one file to edit.
It is worth naming the trap that almost always bites in production: writing if onDeviceFailed { callGemini() } directly in each view model. The first screen is fine. By the third screen, timeout values and ordering drift apart between screens, and bug reports stop being reproducible. I lost several days to exactly this. Centralizing timeouts and ordering in the router eliminates this class of unreproducible bug structurally.
Three criteria for choosing a tier
After running this in production, my routing decisions converged on three criteria.
Privacy. If the input includes user-written text or photos, I restrict it to on-device as a rule. Sending it to the cloud for convenience is an option — but only behind an explicit opt-in in the settings screen
Latency. Calls that block the UI (completions, classification) get a deadline under one second, which makes them effectively on-device-only. On my iPhone 16, short generations come back in roughly half a second; the same prompt routed through Tier 3 took 2–3 seconds, a 4–6x gap in practice. In my production app, about 80% of all AI calls are now served entirely by Tier 1. Background work like summarization gets a 10-second budget and is allowed to reach the upper tiers
Cost. The third-party tier is metered, so I never route features with unpredictable call volume to it. Counting calls just before the AIRouter and automatically capping at Tier 2 once the monthly budget is spent keeps things safe. When in doubt, I recommend defaulting conservatively to the lower tier
When delegating to Antigravity, hand over the tier boundary first
Implementing this abstraction layer is a job that delegates well to an Antigravity agent — with one caveat about how you brief it.
Tell the agent only "add Gemini support" and it will, with high probability, sprout a client directly inside a view model. Agents take the shortest path through existing code. The briefing pattern that worked for me was handing over the protocol definition and the prohibitions before the task.
Include the AITextClient protocol definition file in the context first
State the negative constraint explicitly: no file other than AIRouter may import the concrete clients
Set "zero diff in existing call-site code" as an acceptance criterion
Since adopting these three points, the agent has almost never broken the design boundary. An abstraction layer turns out to be not just for humans — it is also the fence that makes delegating work to agents safe.
Wrapping up: start by auditing your import statements
Architecture talk stays theoretical until there is a concrete first step. There is exactly one to take here.
Run something like grep -r "import GoogleGenerativeAI" --include="*.swift" . across your project and count how many files depend directly on a model SDK. If that number has two digits, defining the protocol is worth doing before your next feature.
A week of back-to-back announcements from Apple and Google has shifted the ground under in-app AI. I hope this design serves you well when you rebuild on it.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.