Green in the Embedded Emulator, Broken on the First Real Device — Putting a Parity Gate on AI-Generated Compose Apps
AI Studio now generates Kotlin/Compose apps from a prompt, runs them in an embedded emulator, and pushes them to a real device over USB — all from one screen. Yet a screen that passed in the emulator can break the first time it lands on a real phone. As a solo developer running several apps, here is how I put a gate that catches device parity issues before they ship.
The first thing that broke was a settings screen in one of my wallpaper apps. I asked AI Studio to rearrange the settings into cards, checked it in the embedded emulator, and the spacing and corners looked exactly right. Reassured, I pushed it over USB to a real device — and only there did I notice the bottom card was half-hidden under the navigation bar.
Green in the emulator, red on the device. That combination is the worst one. Now that generation, execution, and device transfer all connect in a single screen, the round trip of checking has become remarkably fast. But what got faster was checking in the emulator, not checking on a device. For my first few weeks I underestimated that gap, and more than once I took the long way around — shipping to internal test and getting the problem reported back by a tester.
This article is about how to put a verification gate on Compose apps generated by AI Studio or Antigravity that gets ahead of the emulator-versus-device gap. The faster generation gets, the more deliberately slow and firm I keep verification. Here is where I draw that line, and the implementation behind it.
Why the emulator passes and the device breaks
Device differences look capricious, but their sources fall into roughly four buckets. Holding that map in advance tells you where to suspect the generated code.
System insets — notch and navigation bar
The first is system insets. Emulators often run with plain gesture navigation and no cutout, so a generated layout that ignores WindowInsets never shows the problem. Real devices have notches, punch holes, three-button nav, and rounded corners, and that is where elements first disappear. My sunken card was exactly this: the generated code never applied navigationBarsPadding().
Font scale — the weakness of fixed dp
The second is fonts and scale. Real users change display size and font scale. A fixed dp height that looks tidy only at the emulator default (scale 1.0) clips its text on a device at fontScale 1.3. AI-generated layouts tend to reach for fixed heights to make the visuals neat, and that is where they are weak.
GPU rendering — how effects drop
The third is GPU rendering differences. The emulator uses the host GPU, so effects like blur, graphicsLayer, and RenderEffect come out smooth. On a specific device GPU and OS, the same effect can drop or turn into visible jank. For an app like a wallpaper browser that leans on blur, this gap is not negligible.
Locale and RTL — breaking on the flip
The fourth is locale and RTL. If you only ever check in your own language in the emulator, you never notice where elements flip and break under RTL languages like Arabic. Even for a Japanese-market app, store review and overseas testers will hit RTL routinely.
With these four in mind, "check the generated screen on a device" becomes "hunt for device differences along four axes." That finds breakage far faster than staring at the screen vaguely.
Turn the axes into checks that don't disappear
Axes that live in your head slip away on a tired night. I decided to fix the four axes as checks that don't disappear — in code and in previews. The aim is to drop the assumption that a human has to remember them on every generation.
First, insets and font scale get surfaced in Compose previews. Placing extreme-condition previews next to the generated composable puts the breakage in front of you before you ever start the emulator.
// Pin "mean previews" next to the generated code to surface device gaps.// Raise fontScale, squeeze into a short height, apply RTL.@Preview( name = "Large font + short height", fontScale = 1.5f, heightDp = 360, showBackground = true,)@Preview( name = "RTL locale", locale = "ar", showBackground = true,)@Composableprivate fun SettingsCardStressPreview() { AppTheme { // Inject a simulated system-bar region to make a missing inset visible. Box(Modifier.windowInsetsPadding(WindowInsets.systemBars)) { SettingsCardList(items = sampleSettingsItems) } }}
Just applying fontScale = 1.5f and heightDp = 360 makes text crammed into a fixed dp height visibly clip in the preview. Adding locale = "ar" reveals where asymmetric padding flips and breaks. Right after generation, you can hand it back to the AI — "this preview is broken, make the layout follow fontScale" — and close the loop before opening the emulator.
Missing insets get sealed permanently on the layout side. For a scrolling list, convert the insets into padding so the end of the content never hides under the navigation bar.
// Flow insets into contentPadding so the list tail never sinks behind the system bar.LazyColumn( contentPadding = WindowInsets.systemBars .add(WindowInsets(top = 8.dp, bottom = 8.dp)) .asPaddingValues(),) { items(settingsItems, key = { it.id }) { item -> SettingsCard(item) }}
This makes the sunken-card accident structurally impossible. The point is not to "ask for this fix every time" against the generated code, but to hold it in the template so it is built into the starting point of generation. You confine what you delegate to the AI to a foundation that resists breaking. I strongly recommend hardening this footing before you move into production operation. The pitfall is overtrusting that a human will recall the same caveats on every generation.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A four-axis map — fonts, insets, GPU, locale — for getting ahead of where emulator and device diverge
✦Concrete steps for turning Compose preview snapshots and the Android CLI into a CI parity gate
✦An operating contract that separates failures an AI may fix from failures a human must decide
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
In 2026 the Android CLI reached its 1.0 stable release, letting agents run semantic analysis, render Compose previews, and execute UI tests without opening an IDE. I use it as an automatic gate between generation and human eyes. Just taking preview snapshots and diffing against the last run, the machine catches most of the four-axis breakage first.
The idea is this. Bake those mean previews into bitmaps on CI and compare them to baseline images. If the diff exceeds a threshold, drop that generation into the human review queue. Don't trust the emulator's "green" — trust only the rendering diff under fixed conditions.
#!/usr/bin/env bash# preview-parity-gate.sh — bake Compose previews and compare to baselinesset -euo pipefailOUT_DIR="build/preview-shots"BASELINE_DIR="screenshots/baseline"THRESHOLD="${PARITY_THRESHOLD:-0.5}" # max allowed differing-pixel ratio (%)# Render previews via the Android CLI and write them out as PNGs.android-cli compose render \ --module app \ --previews "com.example.settings.SettingsCardStressPreview" \ --output "$OUT_DIR"fail=0for shot in "$OUT_DIR"/*.png; do name="$(basename "$shot")" base="$BASELINE_DIR/$name" if [ ! -f "$base" ]; then echo "⚠️ no baseline: $name (on first run, a human verifies and registers it)" fail=1 continue fi # Count differing pixels with ImageMagick's AE metric, convert to a ratio. diff_px=$(compare -metric AE "$base" "$shot" null: 2>&1 || true) total_px=$(identify -format "%[fx:w*h]" "$base") ratio=$(awk -v d="$diff_px" -v t="$total_px" 'BEGIN { printf "%.3f", (d/t)*100 }') echo "$name: diff ${ratio}%" awk -v r="$ratio" -v th="$THRESHOLD" 'BEGIN { exit (r > th) ? 1 : 0 }' || { echo "❌ over threshold: $name (${ratio}% > ${THRESHOLD}%) -> human review" fail=1 }doneexit $fail
With this script as the first checkpoint after generation, every time the AI rebuilds a screen, "unintended visual changes" surface automatically. If the change was intended, update the baseline and let it pass; if it was unintended breakage, stop it on the spot. Moving the entry point of judgment from human to script means you stop missing visual regressions even during overnight generation.
GPU-dependent effects can't be caught by previews alone. That part I leave to UI tests on a real device. Run a measurement test from the Android CLI and cap the frame time of screens that lean on blur.
// Gate whether a blur-heavy screen stays smooth on a real device.@Testfun blurredWallpaperGrid_staysWithinFrameBudget() { benchmarkRule.measureRepeated( packageName = "com.example.wallpaper", metrics = listOf(FrameTimingMetric()), iterations = 5, setupBlock = { startActivityAndWait() }, ) { device.findObject(By.res("wallpaper_grid")).fling(Direction.DOWN) } // Check the P95 of frameDurationCpuMs against a measured baseline. // If a generation added blur and introduced jank, this turns red.}
You can't judge the lightness of an effect by eye, so build the gate out of numbers. If a generation adds blur and the P95 frame time crosses the baseline, the machine notices first.
Separate failures an AI may fix from failures a human decides
Once you have a gate, the next thing you want to settle is whether the AI may fix a failure when the gate turns red. I wrote this out as a contract. Left vague in an automated pipeline, you get the worst evasion: the AI patches the look to pass the numbers.
What it may fix are failures with a structurally clear cause: missing insets, fixed heights, no fontScale follow, flipped RTL padding. The correct fix for each is determined, so hand it back to the AI with the diff, regenerate, and pass the gate.
What a human decides are failures that carry a trade-off. When a GPU-dependent effect blows the frame budget, do you drop the blur, lower the resolution, or narrow the target devices? That touches how the work looks, so I decide it myself. How far to defend the app's expression is the last territory I don't want to surrender to automation.
Cause of a red gate
Handling
Who decides
Missing inset / fixed height / RTL flip
Regenerate via AI with the diff
Machine returns it; human just confirms the result
Clipping from no fontScale follow
AI fixes to a following layout
Machine returns it
Frame budget exceeded by GPU effect
Trade-off of expression vs. performance
Human decides
Intended diff against baseline
Pass by updating the baseline
Human approves
With this split, you keep the speed of automated generation while leaving only the final responsibility for the look in your own hands. Fast is generation, firm is verification, non-negotiable is expression. Keeping those three speeds separate is where I landed.
Narrow down which real devices you verify on
Sealing device gaps doesn't mean checking on every phone you own every time. I deliberately narrow my verification devices toward "the phones my users actually use."
The basis for that is store statistics. Google Play's in-app device distribution shows which screen sizes, OS versions, and manufacturers dominate. For my wallpaper app, a handful of top models accounted for most usage, so I concentrated verification on those few. Weighting the high-probability cases heavily, rather than aiming for full coverage, suits a one-person operation better.
Combined with the four axes, the picture of which devices to pick comes into focus: one phone with a cutout to see inset differences, one with display size turned up for font scale, one budget device where blur tends to get heavy for GPU. A few deliberately biased phones pick up far more breakage than staring at a single brand-new high-end device.
Build firmness ahead, on a footing of speed
What AI Studio connected in a single screen was the round-trip cost of reversible work. That speed is real, and it genuinely changed the tempo of my making. Which is exactly why I place one deliberately firm gate past the steps that got faster. Don't let the emulator's green be the final verdict; get ahead of the four-axis device gap with previews and on-device tests.
If you want to try it next, add a single set of mean previews to the one screen you most remember breaking. Just applying fontScale, a short height, and RTL puts the weakness — invisible in the emulator — in front of you before you run another generation. Harden the gate one notch at a time from there, and the speed of generation and the expression of the work can, I find, coexist.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.