ANTIGRAVITY LABJP
Articles/AI Tools
AI Tools/2026-04-22Intermediate

Memory Budget Design for Embedding Gemma 4 in Mobile Apps

When embedding Gemma 4 into a mobile app, peak memory during inference — not the model file size — becomes the real bottleneck. A memory budget design based on measured values, plus a device-tier switching strategy.

gemma-419mobile2on-device2memory4optimization3

The first wall I hit after embedding Gemma 4 into a mobile app was not the model size — it was peak memory. "If it's a 1.8 GB quantized model, surely it fits." Then you run inference on a real device and the OS kills you within seconds. Many developers have had this exact experience.

On mobile, the on-disk file size and the peak runtime footprint are two different things. Only when the instantaneous total of model weights, KV cache, activations, and OS reserve fits within the device's envelope will the app run reliably. This article shares the memory budget design I use when shipping Gemma 4 to mobile, with the measured numbers that back it up.

The four memory regions you must account for

If you decompose what an inference-running app occupies in memory, it roughly splits into four regions.

  • Model weights: The quantized weight tensors themselves (static).
  • KV cache: The Key/Value buffers for past tokens (grows linearly with context length).
  • Activations: Intermediate tensors during inference (determined by layer count and hidden size).
  • OS / runtime reserve: GPU driver, ML framework, OS itself.

Judging "it fits" based on model weights alone guarantees you'll overrun in the other three regions. Here are my measured numbers (Gemma 4 4-bit quantized, 4096-token context, iOS via MLX).

  • Model weights: 1.85 GB
  • KV cache (4096 tokens): about 350 MB
  • Activations: about 220 MB
  • OS / runtime reserve: about 180 MB
  • Peak total: about 2.6 GB

On iPhone 14 (6 GB physical RAM) the app gets allocated roughly 3 GB, so this is right at the edge of stability. On iPhone 12 (4 GB) you have to either shrink context to 2048 or switch to a smaller-quantization variant.

Device-tier profiles I actually ship

Internally I classify devices into three tiers and serve a different model/config to each.

Tier 1: High-end (8 GB+ RAM) — iPhone 15 Pro, Pixel 9 Pro, etc.

  • Model: Gemma 4 4-bit, 4096-token context
  • KV cache: allow the full 4096 tokens
  • Streaming decode latency: roughly 35 tokens/sec

Tier 2: Mid-range (6 GB RAM) — iPhone 14, Pixel 8, etc.

  • Model: Gemma 4 4-bit, 2048-token context
  • KV cache: capped at 2048, with sliding-window eviction
  • Streaming decode latency: roughly 22 tokens/sec

Tier 3: Entry-level (4 GB RAM) — iPhone 12, older Androids

  • Model: Gemma 4 3-bit quantization, 1024-token context
  • KV cache: 1024, plus aggressive sliding window
  • Streaming decode latency: roughly 14 tokens/sec

The mistake many developers make is trying to ship Tier 1 settings to all devices. On entry-level devices this backgrounds the app within seconds, and once iOS flushes the model out of memory, the next "open app → ready to infer" takes 15+ seconds. Better to quietly degrade quality than to serve an unusable experience.

Auto-detecting device tier at startup (Swift)

On iOS you can combine ProcessInfo.physicalMemory with activeProcessorCount to infer the tier. Don't rely on device model strings — they fall apart the moment a new device ships.

import Foundation
 
enum DeviceTier {
    case high, mid, entry
 
    static func detect() -> DeviceTier {
        let ramGB = Double(ProcessInfo.processInfo.physicalMemory) / 1_073_741_824.0
        let cores = ProcessInfo.processInfo.activeProcessorCount
 
        if ramGB >= 7.5 && cores >= 6 {
            return .high
        } else if ramGB >= 5.5 {
            return .mid
        } else {
            return .entry
        }
    }
 
    var contextLength: Int {
        switch self {
        case .high: return 4096
        case .mid: return 2048
        case .entry: return 1024
        }
    }
 
    var modelAssetName: String {
        switch self {
        case .high, .mid: return "gemma-4-4bit"
        case .entry: return "gemma-4-3bit"
        }
    }
}

In production I also mix in current thermal state (ProcessInfo.processInfo.thermalState). When thermal state is .serious or .critical, I force Tier 3 settings regardless of hardware — this alone has dramatically reduced both user-reported lag and OS-triggered terminations.

KV cache is where you have to be surgical

Among the four regions, KV cache is the one that grows noticeably during user interaction, so clamping its upper bound has the highest leverage. I combine two techniques.

One is a strict context-length cap — if the input exceeds the tier's limit, I truncate the oldest tokens. Two is a sliding-window strategy for long-running sessions: once the cache hits the limit, I drop the oldest 25% and keep only the 75% closest to the current position.

struct SlidingKVCache {
    var keys: [MLXArray]
    var values: [MLXArray]
    let maxLength: Int
    let keepPrefix: Int  // Tokens to preserve for system prompt etc.
 
    mutating func appendAndTrim(newKeys: MLXArray, newValues: MLXArray) {
        keys.append(newKeys)
        values.append(newValues)
 
        let total = keys.reduce(0) { $0 + $1.shape[2] }
        if total > maxLength {
            let dropCount = total - Int(Double(maxLength) * 0.75)
            trimFromMiddle(dropCount: dropCount, preservePrefix: keepPrefix)
        }
    }
}

The keepPrefix argument matters: if you always protect the first 128 tokens or so from eviction, the system prompt's instructions don't get erased mid-session. Without this, response style drifts 30+ turns in and the app starts feeling like "a different AI."

Observe memory pressure and respond proactively

Relying solely on iOS's didReceiveMemoryWarningNotification is too late. I take two preventive actions.

1. Use os_signpost to log inference-time memory usage

import os.signpost
 
let log = OSLog(subsystem: "app.gemma", category: "inference")
 
func trackMemory(label: String) {
    var info = mach_task_basic_info()
    var count = mach_msg_type_number_t(MemoryLayout<mach_task_basic_info>.size / MemoryLayout<Int32>.size)
    let result = withUnsafeMutablePointer(to: &info) {
        $0.withMemoryRebound(to: integer_t.self, capacity: Int(count)) {
            task_info(mach_task_self_, task_flavor_t(MACH_TASK_BASIC_INFO), $0, &count)
        }
    }
    if result == KERN_SUCCESS {
        let usedMB = Double(info.resident_size) / 1_048_576.0
        os_signpost(.event, log: log, name: "memory", "%{public}s: %.1f MB", label, usedMB)
    }
}

I call this in Instruments before inference, after weight load, at every 100 decode steps, and at response completion. Being able to see which phase peaks allows much more pointed optimization.

2. Auto-unload on memory warnings

NotificationCenter.default.addObserver(
    forName: UIApplication.didReceiveMemoryWarningNotification,
    object: nil,
    queue: .main
) { _ in
    GemmaRunner.shared.pauseInference()
    GemmaRunner.shared.purgeKVCache()
    // Keep weights loaded — reloading them would be much more costly.
}

Purging just the KV cache is much cheaper than unloading weights (reloading takes 3-5 seconds), but it frees several hundred MB immediately, so this alone recovers from most memory warnings. I only drop the weights in the most extreme cases.

Pitfalls I've actually been bitten by

Misreading OS reserve memory. When you embed an ML framework for the first time, the GPU driver and Metal Performance Shaders' default buffers balloon on initial load — often by 300-500 MB. I used to underestimate this until the first real-device profile made it obvious.

Forgetting the peak is during prefill. Gemma 4's peak memory spikes during the initial prefill over the input prompt. With a 4096-token prompt this briefly exceeds steady-state peak by several hundred MB. Plan headroom for the prefill moment, not for steady-state inference.

Thinking a release build's numbers will be the same. Memory footprints differ significantly between Debug builds and Release builds with optimizations on. Always take measurements on a Release build. I once shipped based on Debug-build profiling and ran into reports of production devices crashing during inference.

What to try next

If you're already running Gemma 4 on mobile, start by actually measuring on a Tier 3 device. Unless you have peak numbers from an iPhone 12 or a 4 GB Android device, the device-tier design above is hypothetical. Once you have the real numbers in hand, you'll see which region to tune first — and more often than not, the answer is "shrink the context length." That's where I'd start.

Share

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

If you found this article helpful, a small tip ($1.50) would mean a lot to us. Your support helps keep this site ad-free and covers server and hosting costs.

Related Articles

AI Tools2026-06-12
Cutting Down 'Plausible but Wrong' RAG Answers — A Retrieval Evaluation Harness for Gemma 4 and Antigravity
Replace gut feeling with recall@5, MRR and faithfulness scores — a 30-question golden dataset and a small Python harness for evaluating a local Gemma 4 RAG stack.
AI Tools2026-05-10
Gemma 4 on Antigravity: Picking Q4 vs Q5 — What I Found After a Week on M2 Mac
A hands-on comparison of Gemma 4 quantization variants (Q4_K_M / Q5_K_M / Q8_0 / fp16) running locally with Antigravity on a 16GB M2 Mac, measured across speed, memory, and output quality.
AI Tools2026-04-22
Running Multiple Gemma 4 LoRAs in Production — A Practical Guide to Merging and Dynamic Adapter Switching
You've trained three LoRAs on Gemma 4 — one for summarization, one for translation, one for code review. Now the real question: how do you serve them in production without tripling your GPU bill? This is my working notebook on merging and dynamic switching, written with Antigravity alongside.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →