After Generating Several Candidates, Which One Do You Adopt? Designing Best-of-N That Arbitrates by Verification
With Gemini 3.5 Flash's speed, generating several implementations of the same task has become practical. The hard part is no longer generation but arbitration. Here is the design and TypeScript implementation of a Best-of-N arbiter that picks the winner using verifiable signals only — not majority vote, not self-reported confidence.
I once handed the same fix to an Antigravity agent three times and got three different pieces of code back. All three looked plausible, yet one broke the tests, one failed type checking, and only the last actually ran. The problem was that a human had to read all of them every time to spot which one worked. Review took fifteen minutes, and the work I had supposedly automated quietly turned back into manual labor. As an indie developer running several Dolice sites alone, this "a human has to spot the right one every time" step is the single biggest bottleneck to automation. Until I redesigned it, I spent every morning inspecting the output of the overnight batch.
On June 18, Gemini CLI merges into Antigravity CLI, with Gemini 3.5 Flash as the engine. Flash is reported to be several times faster than the larger models on comparable tasks, which makes generating three or five candidates in parallel for a single problem affordable in both cost and time. In other words, the center of gravity shifts from "trust one answer" to "choose among several."
The protagonist here is not the generator but the arbiter. How do you compare multiple candidates, and on what grounds do you narrow them to one? This article focuses solely on that design.
Why Majority Vote and Self-Reporting Cannot Be Trusted
The first idea is majority vote: adopt the answer that appears most often. But LLM errors are not independent. Generate from the same prompt and the same model, and the mistakes come out aligned too. If two of three candidates share the same bug, majority vote elects that bug as the "right answer."
The next idea is to ask the model itself for a confidence score. This is just as fragile. Self-reported confidence correlates almost not at all with actual correctness. A fluent, assertive wrong answer can even return higher confidence.
The adoption criterion must be a verifiable signal that is independent of the generating model. For code: does it type-check, are the tests green, does it actually start? These are objective facts, unmoved by the model's mood. The arbiter's whole job is to run candidates through these facts and pick the survivor.
The Overall Design
It helps to split the pipeline into three stages.
Stage
Responsibility
Independence
Generation
Generate N candidates in parallel from one spec
Model-dependent
Verification
Run each candidate through objective gates and score it
Independent of the model
Arbitration
Decide on one adoption from score and budget
Rule-based
The crucial point is to fully decouple verification and arbitration from generation. Whatever model the generator uses, however many candidates it produces, the verification and arbitration code stays the same. Hold that boundary and the evaluation criteria remain stable even when the model is swapped out.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Get the full Arbiter implementation (TypeScript) that judges candidates by verifiable signals rather than majority vote or self-reporting
✦Learn the staged-gate evaluation order that runs type checks, tests, and smoke runs so weak candidates drop out early and cost stays bounded
✦Understand the degradation strategy for when every candidate fails, when there is a tie, and when the budget runs out — plus the pitfalls hit in production
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Running every gate on every candidate is wasteful. There is no point running tests on a candidate that fails type checking. Order the gates from cheapest to most expensive and eliminate failing candidates where they fall.
Order
Gate
Rough cost
Decision
1
Static checks (lint / types)
hundreds of ms
Eliminate immediately on failure
2
Relevant tests (impacted scope only)
seconds
Deduct score by failure count
3
Smoke run (startup / main paths)
tens of seconds
Deduct score on exception
With this order, obviously broken candidates disappear before reaching tests or the smoke run. Even if you generate five candidates with Flash, often only one or two actually reach the smoke stage, so total verification cost does not scale linearly with candidate count.
Implementing the Arbiter
This is the core: run candidates through gates, score them, and return an adoption. Shown in TypeScript. Swap the verification functions for whatever your environment uses (type-check command, test runner).
type Candidate = { id: string; apply: () => Promise<void>; // reflect the candidate into the worktree revert: () => Promise<void>; // roll the reflection back};type GateResult = { passed: boolean; penalty: number; note: string };type Gate = { name: string; run: () => Promise<GateResult>; fatal: boolean; // true: eliminate on failure / false: deduct only};type Verdict = | { kind: "adopt"; id: string; score: number; trail: string[] } | { kind: "all_failed"; trail: string[] } | { kind: "budget_exhausted"; trail: string[] };async function arbitrate( candidates: Candidate[], gates: Gate[], opts: { maxVerifyMs: number },): Promise<Verdict> { const trail: string[] = []; const start = Date.now(); let best: { id: string; score: number } | null = null; for (const c of candidates) { if (Date.now() - start > opts.maxVerifyMs) { trail.push(`budget: stopped before verifying ${c.id}`); return best ? { kind: "adopt", id: best.id, score: best.score, trail } : { kind: "budget_exhausted", trail }; } await c.apply(); let score = 100; let eliminated = false; for (const g of gates) { const r = await g.run(); trail.push(`${c.id}/${g.name}: ${r.passed ? "ok" : "ng"} ${r.note}`); if (!r.passed && g.fatal) { eliminated = true; break; } if (!r.passed) score -= r.penalty; } await c.revert(); if (eliminated) continue; // Ties go to the first arrival (pass candidates in a stable order for determinism) if (!best || score > best.score) best = { id: c.id, score }; } if (!best) return { kind: "all_failed", trail }; return { kind: "adopt", id: best.id, score: best.score, trail };}
There are three design points. First, always cycle apply → verify → revert per candidate so candidates do not pollute each other's state. Second, eliminate early with fatal gates to skip wasted verification. Third, record every decision into trail. Being able to reproduce later why a candidate was chosen (or dropped) is a lifeline in autonomous operation.
How to Fold Up Ties, Total Failure, and Budget Exhaustion
The unhappy paths matter more in design than the happy one.
On a tie, you may be tempted to fall back to majority vote, but here we take the first arrival in stable order. Choosing at random makes the result change between runs for the same input, breaking reproducibility. Always pass candidates in the same order and keep arbitration deterministic.
When every candidate fails (all_failed), do not auto-adopt. Push the candidate with the smallest penalty onto a "waiting for human review" queue and keep it out of the mainline. Forcing one through here leads to the worst accident: a broken change merged automatically.
Budget exhaustion (budget_exhausted) is insurance for when verification takes too long. Once maxVerifyMs is exceeded, take the provisional best so far, or abort if there is none. If you run parallel agents overnight, capping per-task verification is essential — otherwise one slow run stalls the whole batch.
How Many Candidates Should You Generate?
Larger N is not always better. Verification cost and adoption quality both have diminishing returns. Across the routine fix tasks I observed, the trend looked roughly like this.
Candidates N
Share where at least one passes every gate
Verification time per task
1
~60%
~12s
3
~88%
~20s
5
~93%
~27s
The jump from one to three candidates is large; three to five is small. I default to N=3 for routine tasks and only raise it to N=5 for task types that have failed often in the past. Because Flash keeps the generation side roughly flat, what really matters is not letting the verification side balloon — which is what staged gates are for.
Three Checks Before Putting It Into Operation
Before wiring this into an autonomous pipeline, I always confirm these three points.
Are the verification gates truly independent of generation? If you have the same agent write the type checks or tests themselves, independence collapses. Fix the gates by hand and let the agent handle only generation.
Does revert reliably roll back? If the worktree stays dirty when you apply the next candidate, the score gets dragged by the previous one. Isolating each candidate with git worktree is the solid choice.
Is the path designed so total failure never enters the mainline? This is the one place to give up on automation and hand off to a human; not breaking that design is your safety valve for long-term operation.
Rather than adding more candidates, build verification to be independent and staged. In an era where generation has become fast and cheap, I believe what pays off is the design of the side that chooses. I hope this gives a useful handhold to anyone wrestling with the same problem.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.