When CLI, Desktop, and SDK Share One Agent Harness: Designing for Consistent Behavior
Now that Antigravity's CLI, desktop, and SDK share one agent harness, here is how to separate what stays consistent from what differs by environment, and how to align behavior with smoke tests and a version-tracking habit.
When Antigravity CLI consolidated as the successor to the Gemini CLI, one line in the official explanation caught my attention: because Antigravity CLI shares the same agent harness as the desktop version, improvements to the core agent automatically reach every surface.
That is convenient, but from a designer's seat it is also a change that calls for new care. As an indie developer, I often move a workflow I prototyped interactively on the desktop onto scheduled CLI runs, so I needed to be clear about how far I can expect identical behavior. Here is how I think about aligning the behavior of the CLI, desktop, and SDK in this shared-harness era.
What "sharing one harness" means
An agent harness is, roughly, the core behavior of how an agent reasons, how it calls tools, and how it judges results. The crux of this change is that this core becomes common across the CLI, desktop, and SDK.
So the quirks of agent judgment you observe on the desktop will, in principle, show up the same way in the CLI. Put the other way, you can carry behavior confirmed on the desktop into automated CLI runs to a meaningful degree.
But "in principle" is the catch. What is shared is the core behavior, not the environment around it. Conflate the two and you get the classic "it worked on the desktop but fails in the CLI."
The benefit: one improvement reaches every surface
The biggest benefit of a shared harness is that core improvements arrive on every surface automatically. The old split where "only the CLI is stuck on outdated behavior" becomes far less likely.
In my operation this pays off when I take a prompt structure I refined interactively and move it straight into a scheduled run. I confirm the agent's response tendencies on the desktop and carry those learnings into the CLI scheduled task. When the core is common, less is lost in that transfer.
Even when the surface for learning differs from the surface for production, a shared foundation keeps knowledge from going to waste. For indie work, where one person shuttles between experiment and operation, that is a welcome property.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦How to separate what a shared harness keeps consistent from what environment differences break
✦A lightweight smoke test that checks behavioral parity across surfaces
✦A habit for tracking versions when core improvements auto-propagate to every surface
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
The caution: testing on the assumption of sameness will burn you
On the other hand, over-trusting the shared harness will trip you up. Only the core behavior is aligned; the input/output path, authentication, and execution environment differ by surface.
Here are pitfalls I actually hit. An operation that asks for interactive approval on the desktop hangs waiting for approval in a non-interactive CLI run. A session that is logged in on the desktop gets rejected in the CLI because the auth token has expired. The agent's judgment is the same, but the foundation that runs it differs, so it stalls in production.
This is why skipping tests on the logic that "the core is the same, so every surface should behave the same" is dangerous. Trust the parts that align; verify per surface the parts that do not. That separation is the heart of test design in the shared-harness era.
What differs by surface is the environment, not the agent
Laying out what aligns and what does not lets you aim your tests.
Aspect
Aligned?
Verification needed?
Agent judgment and tool-call tendencies
Generally aligned
Per-surface checks can be light
Presence of interactive approval
Differs by surface
Must confirm it does not hang in non-interactive runs
Auth and token lifetime
Differs by surface
Must confirm expiry behavior under unattended runs
Execution environment and available tools
Differs by surface
Verify in the same environment as production
The "differs by surface" rows on the right are what tests should focus on. Precisely because the core is assumed common, you can concentrate your verification effort on the parts that differ by environment.
Design for aligning behavior
There are two concrete ways to align. One is to specify things explicitly rather than rely on implicit defaults. The other is to keep a cross-surface smoke test.
Explicit specification is simple. For items whose defaults likely differ by surface — model name, interactive mode, output format — state them all in the command or config. Leave them to defaults and the moment the surface changes, the default changes too, and behavior drifts.
The smoke test feeds the same input to each surface and checks a minimal sameness.
# Run the same minimal task on CLI and desktop (headless), compare the skeletonTASK="content/_fixtures/smoke_task.json"# Make non-interactive / unattended explicit to detect hangsagy run --non-interactive --no-prompt --task "$TASK" --out /tmp/smoke_cli.jsonagy desktop --headless --task "$TASK" --out /tmp/smoke_desktop.json# Compare only the success flag and the presence of an artifact (not full text)for f in /tmp/smoke_cli.json /tmp/smoke_desktop.json; do python3 -c "import json; d=json.load(open('$f')); print('ok' if d.get('status')=='success' and d.get('artifact') else 'FAIL', '$f')"done
Again, I do not demand a full-text match. I check only two things: did it complete without hanging in non-interactive mode, and does the artifact skeleton have the same shape on both surfaces. Since the core is shared, if this minimal check passes, I trust the remaining details.
Making version tracking a habit
With a shared harness, core improvements arrive on every surface automatically. That is a benefit, but the flip side is that there are moments when behavior changes even though you changed nothing.
I guard against this with two habits. One is to always wedge the smoke test above ahead of the scheduled run, so that if a core update shifts behavior, I catch the anomaly before production. The other is to diff config files and defaults at each update milestone.
There is no need to stop the updates themselves. The whole benefit of a shared harness is that the core keeps getting better, so refusing to follow along is lost opportunity. Rather than stopping, put a mechanism in place that absorbs change just before production. That, I think, is the safe way to live with updates that descend automatically.
The trend of CLI, desktop, and SDK sharing one foundation will likely go further. Trust the aligned core and shuttle knowledge across surfaces, verify the unaligned environment differences per surface, and catch change with a smoke test just before production. Make those three a habit and the shared harness becomes not a burden but a dependable foundation for indie work that loops between experiment and operation. If you also move between several surfaces, I hope this helps.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.