Memory Budget Design for Embedding Gemma 4 in Mobile Apps

The first wall I hit after embedding Gemma 4 into a mobile app was not the model size — it was peak memory. "If it's a 1.8 GB quantized model, surely it fits." Then you run inference on a real device and the OS kills you within seconds. Many developers have had this exact experience.

On mobile, the on-disk file size and the peak runtime footprint are two different things. Only when the instantaneous total of model weights, KV cache, activations, and OS reserve fits within the device's envelope will the app run reliably. This article shares the memory budget design I use when shipping Gemma 4 to mobile, with the measured numbers that back it up.

The four memory regions you must account for

If you decompose what an inference-running app occupies in memory, it roughly splits into four regions.

Model weights: The quantized weight tensors themselves (static).
KV cache: The Key/Value buffers for past tokens (grows linearly with context length).
Activations: Intermediate tensors during inference (determined by layer count and hidden size).
OS / runtime reserve: GPU driver, ML framework, OS itself.

Judging "it fits" based on model weights alone guarantees you'll overrun in the other three regions. Here are my measured numbers (Gemma 4 4-bit quantized, 4096-token context, iOS via MLX).

Model weights: 1.85 GB
KV cache (4096 tokens): about 350 MB
Activations: about 220 MB
OS / runtime reserve: about 180 MB
Peak total: about 2.6 GB

On iPhone 14 (6 GB physical RAM) the app gets allocated roughly 3 GB, so this is right at the edge of stability. On iPhone 12 (4 GB) you have to either shrink context to 2048 or switch to a smaller-quantization variant.

Device-tier profiles I actually ship

Internally I classify devices into three tiers and serve a different model/config to each.

Tier 1: High-end (8 GB+ RAM) — iPhone 15 Pro, Pixel 9 Pro, etc.

Model: Gemma 4 4-bit, 4096-token context
KV cache: allow the full 4096 tokens
Streaming decode latency: roughly 35 tokens/sec

Tier 2: Mid-range (6 GB RAM) — iPhone 14, Pixel 8, etc.

Model: Gemma 4 4-bit, 2048-token context
KV cache: capped at 2048, with sliding-window eviction
Streaming decode latency: roughly 22 tokens/sec

Tier 3: Entry-level (4 GB RAM) — iPhone 12, older Androids

Model: Gemma 4 3-bit quantization, 1024-token context
KV cache: 1024, plus aggressive sliding window
Streaming decode latency: roughly 14 tokens/sec

The mistake many developers make is trying to ship Tier 1 settings to all devices. On entry-level devices this backgrounds the app within seconds, and once iOS flushes the model out of memory, the next "open app → ready to infer" takes 15+ seconds. Better to quietly degrade quality than to serve an unusable experience.

Auto-detecting device tier at startup (Swift)

On iOS you can combine ProcessInfo.physicalMemory with activeProcessorCount to infer the tier. Don't rely on device model strings — they fall apart the moment a new device ships.

import Foundation
 
enum DeviceTier {
    case high, mid, entry
 
    static func detect() -> DeviceTier {
        let ramGB = Double(ProcessInfo.processInfo.physicalMemory) / 1_073_741_824.0
        let cores = ProcessInfo.processInfo.activeProcessorCount
 
        if ramGB >= 7.5 && cores >= 6 {
            return .high
        } else if ramGB >= 5.5 {
            return .mid
        } else {
            return .entry
        }
    }
 
    var contextLength: Int {
        switch self {
        case .high: return 4096
        case .mid: return 2048
        case .entry: return 1024
        }
    }
 
    var modelAssetName: String {
        switch self {
        case .high, .mid: return "gemma-4-4bit"
        case .entry: return "gemma-4-3bit"
        }
    }
}

In production I also mix in current thermal state (ProcessInfo.processInfo.thermalState). When thermal state is .serious or .critical, I force Tier 3 settings regardless of hardware — this alone has dramatically reduced both user-reported lag and OS-triggered terminations.

KV cache is where you have to be surgical

Among the four regions, KV cache is the one that grows noticeably during user interaction, so clamping its upper bound has the highest leverage. I combine two techniques.

One is a strict context-length cap — if the input exceeds the tier's limit, I truncate the oldest tokens. Two is a sliding-window strategy for long-running sessions: once the cache hits the limit, I drop the oldest 25% and keep only the 75% closest to the current position.

struct SlidingKVCache {
    var keys: [MLXArray]
    var values: [MLXArray]
    let maxLength: Int
    let keepPrefix: Int  // Tokens to preserve for system prompt etc.
 
    mutating func appendAndTrim(newKeys: MLXArray, newValues: MLXArray) {
        keys.append(newKeys)
        values.append(newValues)
 
        let total = keys.reduce(0) { $0 + $1.shape[2] }
        if total > maxLength {
            let dropCount = total - Int(Double(maxLength) * 0.75)
            trimFromMiddle(dropCount: dropCount, preservePrefix: keepPrefix)
        }
    }
}

The keepPrefix argument matters: if you always protect the first 128 tokens or so from eviction, the system prompt's instructions don't get erased mid-session. Without this, response style drifts 30+ turns in and the app starts feeling like "a different AI."

Observe memory pressure and respond proactively

Relying solely on iOS's didReceiveMemoryWarningNotification is too late. I take two preventive actions.

1. Use os_signpost to log inference-time memory usage

import os.signpost
 
let log = OSLog(subsystem: "app.gemma", category: "inference")
 
func trackMemory(label: String) {
    var info = mach_task_basic_info()
    var count = mach_msg_type_number_t(MemoryLayout<mach_task_basic_info>.size / MemoryLayout<Int32>.size)
    let result = withUnsafeMutablePointer(to: &info) {
        $0.withMemoryRebound(to: integer_t.self, capacity: Int(count)) {
            task_info(mach_task_self_, task_flavor_t(MACH_TASK_BASIC_INFO), $0, &count)
        }
    }
    if result == KERN_SUCCESS {
        let usedMB = Double(info.resident_size) / 1_048_576.0
        os_signpost(.event, log: log, name: "memory", "%{public}s: %.1f MB", label, usedMB)
    }
}

I call this in Instruments before inference, after weight load, at every 100 decode steps, and at response completion. Being able to see which phase peaks allows much more pointed optimization.

2. Auto-unload on memory warnings

NotificationCenter.default.addObserver(
    forName: UIApplication.didReceiveMemoryWarningNotification,
    object: nil,
    queue: .main
) { _ in
    GemmaRunner.shared.pauseInference()
    GemmaRunner.shared.purgeKVCache()
    // Keep weights loaded — reloading them would be much more costly.
}

Purging just the KV cache is much cheaper than unloading weights (reloading takes 3-5 seconds), but it frees several hundred MB immediately, so this alone recovers from most memory warnings. I only drop the weights in the most extreme cases.

Pitfalls I've actually been bitten by

Misreading OS reserve memory. When you embed an ML framework for the first time, the GPU driver and Metal Performance Shaders' default buffers balloon on initial load — often by 300-500 MB. I used to underestimate this until the first real-device profile made it obvious.

Forgetting the peak is during prefill. Gemma 4's peak memory spikes during the initial prefill over the input prompt. With a 4096-token prompt this briefly exceeds steady-state peak by several hundred MB. Plan headroom for the prefill moment, not for steady-state inference.

Thinking a release build's numbers will be the same. Memory footprints differ significantly between Debug builds and Release builds with optimizations on. Always take measurements on a Release build. I once shipped based on Debug-build profiling and ran into reports of production devices crashing during inference.

What to try next

If you're already running Gemma 4 on mobile, start by actually measuring on a Tier 3 device. Unless you have peak numbers from an iPhone 12 or a 4 GB Android device, the device-tier design above is hypothetical. Once you have the real numbers in hand, you'll see which region to tune first — and more often than not, the answer is "shrink the context length." That's where I'd start.