ANTIGRAVITY LABJP
Articles/Agents & Manager
Agents & Manager/2026-06-17Advanced

After Generating Several Candidates, Which One Do You Adopt? Designing Best-of-N That Arbitrates by Verification

With Gemini 3.5 Flash's speed, generating several implementations of the same task has become practical. The hard part is no longer generation but arbitration. Here is the design and TypeScript implementation of a Best-of-N arbiter that picks the winner using verifiable signals only — not majority vote, not self-reported confidence.

antigravity369multi-agent45best-of-narbiterverification4Gemini 3.5 Flash3operations design2

Premium Article

I once handed the same fix to an Antigravity agent three times and got three different pieces of code back. All three looked plausible, yet one broke the tests, one failed type checking, and only the last actually ran. The problem was that a human had to read all of them every time to spot which one worked. Review took fifteen minutes, and the work I had supposedly automated quietly turned back into manual labor. As an indie developer running several Dolice sites alone, this "a human has to spot the right one every time" step is the single biggest bottleneck to automation. Until I redesigned it, I spent every morning inspecting the output of the overnight batch.

On June 18, Gemini CLI merges into Antigravity CLI, with Gemini 3.5 Flash as the engine. Flash is reported to be several times faster than the larger models on comparable tasks, which makes generating three or five candidates in parallel for a single problem affordable in both cost and time. In other words, the center of gravity shifts from "trust one answer" to "choose among several."

The protagonist here is not the generator but the arbiter. How do you compare multiple candidates, and on what grounds do you narrow them to one? This article focuses solely on that design.

Why Majority Vote and Self-Reporting Cannot Be Trusted

The first idea is majority vote: adopt the answer that appears most often. But LLM errors are not independent. Generate from the same prompt and the same model, and the mistakes come out aligned too. If two of three candidates share the same bug, majority vote elects that bug as the "right answer."

The next idea is to ask the model itself for a confidence score. This is just as fragile. Self-reported confidence correlates almost not at all with actual correctness. A fluent, assertive wrong answer can even return higher confidence.

The adoption criterion must be a verifiable signal that is independent of the generating model. For code: does it type-check, are the tests green, does it actually start? These are objective facts, unmoved by the model's mood. The arbiter's whole job is to run candidates through these facts and pick the survivor.

The Overall Design

It helps to split the pipeline into three stages.

StageResponsibilityIndependence
GenerationGenerate N candidates in parallel from one specModel-dependent
VerificationRun each candidate through objective gates and score itIndependent of the model
ArbitrationDecide on one adoption from score and budgetRule-based

The crucial point is to fully decouple verification and arbitration from generation. Whatever model the generator uses, however many candidates it produces, the verification and arbitration code stays the same. Hold that boundary and the evaluation criteria remain stable even when the model is swapped out.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Get the full Arbiter implementation (TypeScript) that judges candidates by verifiable signals rather than majority vote or self-reporting
Learn the staged-gate evaluation order that runs type checks, tests, and smoke runs so weak candidates drop out early and cost stays bounded
Understand the degradation strategy for when every candidate fails, when there is a tie, and when the budget runs out — plus the pitfalls hit in production
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

Agents & Manager2026-06-17
Accounting for Which Agent Spent What: A Cost Attribution Design by Task
Your month-end bill is one number, but running multiple agents on Gemini 3.5 Flash hides which task ate the cost. Separate from a budget guard, I share a cost-attribution accounting design that maps usage to per-task and per-site cost, with a solo-operator implementation and numbers.
Agents & Manager2026-06-17
Tracing Parallel Agents After the Fact: Observability with Structured Logs and Spans
Running multiple agents in parallel on the Antigravity 2.0 desktop makes it impossible to tell which one is doing what. I share an observability design that drops tangled print debugging for run_ids and spans you can trace afterward, with a solo-operator implementation and numbers.
Agents & Manager2026-06-17
Making Managed Agent Batches Safe to Re-run: Idempotency and Checkpoints
Running overnight batches on the Antigravity 2.0 Managed Agents API makes recovery from partial failure unavoidable. Starting from a duplicate-post incident, I share the implementation of idempotency keys, a checkpoint store, and resume logic, with real numbers from solo operations.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →