When CLI, Desktop, and SDK Share One Agent Harness: Designing for Consistent Behavior

Now that Antigravity's CLI, desktop, and SDK share one agent harness, here is how to separate what stays consistent from what differs by environment, and how to align behavior with smoke tests and a version-tracking habit.

antigravity³⁸⁰ cli⁴ sdk⁵ testing¹² architecture¹⁷

✦ Premium Article

When Antigravity CLI consolidated as the successor to the Gemini CLI, one line in the official explanation caught my attention: because Antigravity CLI shares the same agent harness as the desktop version, improvements to the core agent automatically reach every surface.

That is convenient, but from a designer's seat it is also a change that calls for new care. As an indie developer, I often move a workflow I prototyped interactively on the desktop onto scheduled CLI runs, so I needed to be clear about how far I can expect identical behavior. Here is how I think about aligning the behavior of the CLI, desktop, and SDK in this shared-harness era.

What "sharing one harness" means

An agent harness is, roughly, the core behavior of how an agent reasons, how it calls tools, and how it judges results. The crux of this change is that this core becomes common across the CLI, desktop, and SDK.

So the quirks of agent judgment you observe on the desktop will, in principle, show up the same way in the CLI. Put the other way, you can carry behavior confirmed on the desktop into automated CLI runs to a meaningful degree.

But "in principle" is the catch. What is shared is the core behavior, not the environment around it. Conflate the two and you get the classic "it worked on the desktop but fails in the CLI."

The benefit: one improvement reaches every surface

The biggest benefit of a shared harness is that core improvements arrive on every surface automatically. The old split where "only the CLI is stuck on outdated behavior" becomes far less likely.

In my operation this pays off when I take a prompt structure I refined interactively and move it straight into a scheduled run. I confirm the agent's response tendencies on the desktop and carry those learnings into the CLI scheduled task. When the core is common, less is lost in that transfer.

Even when the surface for learning differs from the surface for production, a shared foundation keeps knowledge from going to waste. For indie work, where one person shuttles between experiment and operation, that is a welcome property.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦How to separate what a shared harness keeps consistent from what environment differences break

✦A lightweight smoke test that checks behavioral parity across surfaces

✦A habit for tracking versions when core improvements auto-propagate to every surface

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

The caution: testing on the assumption of sameness will burn you

On the other hand, over-trusting the shared harness will trip you up. Only the core behavior is aligned; the input/output path, authentication, and execution environment differ by surface.

Here are pitfalls I actually hit. An operation that asks for interactive approval on the desktop hangs waiting for approval in a non-interactive CLI run. A session that is logged in on the desktop gets rejected in the CLI because the auth token has expired. The agent's judgment is the same, but the foundation that runs it differs, so it stalls in production.

This is why skipping tests on the logic that "the core is the same, so every surface should behave the same" is dangerous. Trust the parts that align; verify per surface the parts that do not. That separation is the heart of test design in the shared-harness era.

What differs by surface is the environment, not the agent

Laying out what aligns and what does not lets you aim your tests.

Aspect	Aligned?	Verification needed?
Agent judgment and tool-call tendencies	Generally aligned	Per-surface checks can be light
Presence of interactive approval	Differs by surface	Must confirm it does not hang in non-interactive runs
Auth and token lifetime	Differs by surface	Must confirm expiry behavior under unattended runs
Execution environment and available tools	Differs by surface	Verify in the same environment as production

The "differs by surface" rows on the right are what tests should focus on. Precisely because the core is assumed common, you can concentrate your verification effort on the parts that differ by environment.

Design for aligning behavior

There are two concrete ways to align. One is to specify things explicitly rather than rely on implicit defaults. The other is to keep a cross-surface smoke test.

Explicit specification is simple. For items whose defaults likely differ by surface — model name, interactive mode, output format — state them all in the command or config. Leave them to defaults and the moment the surface changes, the default changes too, and behavior drifts.

The smoke test feeds the same input to each surface and checks a minimal sameness.

# Run the same minimal task on CLI and desktop (headless), compare the skeleton
TASK="content/_fixtures/smoke_task.json"
 
# Make non-interactive / unattended explicit to detect hangs
agy run --non-interactive --no-prompt --task "$TASK" --out /tmp/smoke_cli.json
agy desktop --headless --task "$TASK" --out /tmp/smoke_desktop.json
 
# Compare only the success flag and the presence of an artifact (not full text)
for f in /tmp/smoke_cli.json /tmp/smoke_desktop.json; do
  python3 -c "import json; d=json.load(open('$f')); print('ok' if d.get('status')=='success' and d.get('artifact') else 'FAIL', '$f')"
done

Again, I do not demand a full-text match. I check only two things: did it complete without hanging in non-interactive mode, and does the artifact skeleton have the same shape on both surfaces. Since the core is shared, if this minimal check passes, I trust the remaining details.

Making version tracking a habit

With a shared harness, core improvements arrive on every surface automatically. That is a benefit, but the flip side is that there are moments when behavior changes even though you changed nothing.

I guard against this with two habits. One is to always wedge the smoke test above ahead of the scheduled run, so that if a core update shifts behavior, I catch the anomaly before production. The other is to diff config files and defaults at each update milestone.

There is no need to stop the updates themselves. The whole benefit of a shared harness is that the core keeps getting better, so refusing to follow along is lost opportunity. Rather than stopping, put a mechanism in place that absorbs change just before production. That, I think, is the safe way to live with updates that descend automatically.

The trend of CLI, desktop, and SDK sharing one foundation will likely go further. Trust the aligned core and shuttle knowledge across surfaces, verify the unaligned environment differences per surface, and catch change with a smoke test just before production. Make those three a habit and the shared harness becomes not a burden but a dependable foundation for indie work that loops between experiment and operation. If you also move between several surfaces, I hope this helps.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.