Antigravity × Lighthouse CI: Catching Web Performance Regressions Automatically— Budgets, PR Comments, and Progressive Blocking
Wire Antigravity's AI to Lighthouse CI inside GitHub Actions and stop performance regressions before they reach production. This guide covers budget design, PR comments, progressive blocking, RUM integration, and cost controls — all in a shape that holds up in real teams.
import RelatedArticles from "@/components/RelatedArticles";
"After the release, the first paint somehow felt heavier" — anyone who has run a web product for long enough has lived this moment. I have, on sites I run myself. Watching Lighthouse scores by hand never lasts as a habit, and the result is a quiet backslide where Largest Contentful Paint creeps up by a few hundred milliseconds a month and nobody notices. By the time the CrUX report or your real-user monitoring dashboard surfaces the trend, you're trying to remember which release out of the last six was the actual culprit.
This guide builds a pipeline that combines Antigravity's AI with Lighthouse CI on GitHub Actions to structurally prevent that drift. We don't stop at running Lighthouse and writing the result to a dashboard. We let the AI read the delta and the PR diff, then write back a comment with hypotheses about the cause and a concrete next step. We also close the loop back to production by feeding RUM data into the budget, so the lab numbers and the field numbers stay in conversation.
Why "we just noticed it got slow" is unacceptable
Performance regressions are nasty precisely because nothing breaks. They never show up as a red CI light, users rarely report them, and yet bounce and exit rates quietly degrade. I once shipped a PR that swapped a hero image from WebP to PNG. No one caught it on review. A week later, CrUX field data was the first to surface the problem, and by then four other releases had landed on top of it. Untangling which change actually caused the regression took an afternoon I did not have.
You cannot solve this by "being careful." There are too many surfaces to watch — image formats, third-party scripts, font swaps, hydration costs, route bundle sizes, the new dependency someone added "just for this one feature" — and human attention does not scale across all of them. The realistic answer is to encode a budget into CI and stop the build when it is exceeded.
The reason to add Antigravity's AI on top is that bare numbers leave the reviewer asking "okay, but what do I do about it?". A red CI with a 300 ms TBT increase tells the reviewer exactly nothing about which file or dependency caused the increase. Hand the AI the diff and the metrics, and let it write the hypothesis and the next move into the PR comment. Reviews start moving again, and the bot becomes something the team trusts rather than something they route around.
There is also a quieter benefit that becomes apparent only after a few months of running this: the AI comments themselves form a corpus. When a new engineer joins the team and asks "why do we always argue about font preloading?", you can point them at six months of PR comments where the bot explained the same regression mechanism and the team's response. The comment thread becomes institutional memory.
The big picture — Antigravity, Lighthouse CI, and GitHub Actions
This pipeline has three layers. Naming the responsibilities up front saves you from confusion later.
Lighthouse CI runs Lighthouse three times against the preview deployment and collects metrics. The median is what we use, to keep flakiness down. Lighthouse CI is also what enforces the pass/fail decision — there is no AI in this layer at all.
Performance Budget is the set of thresholds. Declared in lighthouserc.js as things like "LCP under 2,500 ms" and "TBT under 200 ms." This file is the contract between the team and the bot, and it lives in version control like any other contract.
Antigravity Agent reads the PR diff, the Lighthouse JSON, and the previous main numbers when the budget is exceeded, and writes a hypothesis plus a suggestion back as a PR comment. The agent never decides whether the build passes; it only explains.
The split that matters: Lighthouse decides, AI explains. If you let the AI also decide pass/fail, the pipeline becomes non-deterministic and trust in CI evaporates. The same PR can pass on Monday and fail on Tuesday with no code change, simply because the model felt different that day. That is the fastest way to make engineers ignore the bot.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦You can now block the slow drift of 'somehow it got slower after the last release' at the CI gate, instead of catching it in field data weeks later
✦You'll learn a production pattern where Antigravity's AI reads the Lighthouse delta and your PR diff to surface the most likely culprit — not just the numbers
✦You'll walk away with a Performance Budget, a GitHub Actions workflow, an AI comment template, and a RUM feedback loop you can drop into your own repo this weekend
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Step 1: Setting the budget — start from "the lot you already own"
The first hard part is choosing the threshold. If you start from textbook values (LCP < 2,500 ms), almost every PR will go red and within two weeks no one will look at the bot. The rule I use on my own sites is simple: take the current median and tighten it by 5%. That gives you a budget with the same shape as your current performance, and it puts pressure on regressions without pretending that you've already solved performance.
The minimum lighthouserc.js looks like this.
// lighthouserc.js — Performance Budget// Lighthouse CI reads this; non-zero exit when thresholds fail.module.exports = { ci: { collect: { // Pass the preview URL (Vercel, Cloudflare Pages, etc.) via env. url: [process.env.PREVIEW_URL], numberOfRuns: 3, // median of three runs to suppress flakiness settings: { preset: "desktop", // run mobile as a separate job throttlingMethod: "simulate", }, }, assert: { assertions: { // Use "current median × 1.05" as your starting threshold. "categories:performance": ["error", { minScore: 0.85 }], "largest-contentful-paint": ["warn", { maxNumericValue: 2500 }], "total-blocking-time": ["error", { maxNumericValue: 300 }], "cumulative-layout-shift": ["error", { maxNumericValue: 0.1 }], // Bundle bloat shows up earlier than CLS or TBT regressions. "resource-summary:script:size": ["warn", { maxNumericValue: 350000 }], }, }, upload: { target: "temporary-public-storage", // gives a result URL we can paste into the PR comment }, },};
What this should produce: any PR with Performance below 0.85 or TBT above 300 ms turns the CI red, LCP only warns (does not block), and the result URL is auto-attached to the PR comment. Mixing error and warn lets you onboard a team gradually instead of stopping every PR from day one.
Why mix in warn instead of going all-error? Because if you tighten the budget hard at the start, your team will mentally install a "Lighthouse CI is noise, ignore it" filter. Once that filter exists, the pipeline is effectively dead. Start with warn, let the comments build credibility, and promote to error after that. There's no rush — performance budgets are infrastructure, and infrastructure earns trust by being calm and predictable for a few months before it starts demanding things.
One more nuance: keep your bundle-size budgets a step ahead of your timing budgets. Bundle growth is a leading indicator. By the time TBT or LCP regress visibly, you've usually already shipped the extra JavaScript that caused it. A warn on resource-summary:script:size catches the regression a week before it shows up in user-perceived metrics, and that gives the team time to react with a refactor instead of a panic.
Step 2: GitHub Actions — measure only after the preview is up
Run Lighthouse only after Vercel / Cloudflare Pages / Netlify finishes deploying. If you measure too early, the build is not ready and Lighthouse picks up a 404 — and the score for a 404 page is meaningless. Worse, the AI will dutifully analyze the regression of "the page got smaller and faster" and produce nonsense.
# .github/workflows/lighthouse-ci.ymlname: Lighthouse CIon: pull_request: types: [opened, synchronize, reopened]jobs: lhci: runs-on: ubuntu-latest # Wait for the preview to finish deploying. For Vercel, the # deployment_status event is the most reliable trigger. if: github.event.pull_request.head.repo.fork == false steps: - uses: actions/checkout@v4 - name: Wait for Preview Deployment id: wait uses: patrickedqvist/wait-for-vercel-preview@v1.3.1 with: token: ${{ secrets.GITHUB_TOKEN }} max_timeout: 600 # up to 10 minutes environment: Preview - uses: actions/setup-node@v4 with: node-version: 20 - name: Run Lighthouse CI env: PREVIEW_URL: ${{ steps.wait.outputs.url }} run: | npm install -g @lhci/cli@0.13.x lhci autorun || echo "LHCI_FAILED=true" >> $GITHUB_ENV - name: Upload report uses: actions/upload-artifact@v4 with: name: lhci-report path: .lighthouseci/ retention-days: 14 - name: Trigger Antigravity AI Analysis if: env.LHCI_FAILED == 'true' # The AI analysis job we cover in detail in Step 4. run: gh workflow run antigravity-perf-analysis.yml -F pr=${{ github.event.number }} env: GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
The trick here is || echo "LHCI_FAILED=true" — we don't swallow the failure, we record the fact and use it to trigger a follow-up AI job. If you let a non-zero exit kill the whole pipeline at this step, the AI analysis never runs and the reviewer is left with "the number got worse" and nothing else. That is the worst possible UX for a CI bot. The reviewer either reads the raw Lighthouse JSON themselves (they won't) or merges anyway (they will).
The artifact upload is also load-bearing. By keeping .lighthouseci/ for fourteen days, you can re-run the AI analysis after the fact when someone says "wait, that comment was wrong" — without needing to redeploy or re-measure. Forensics matter when you're trying to build trust.
Step 3: Snapshot main and compute the diff
"It got worse" is not enough information. To let the AI name a likely culprit, you need both the PR's numbers and main's numbers. Without a baseline, the AI cannot reason about deltas; it can only describe absolute states.
#!/usr/bin/env bash# scripts/lhci-diff.sh — Build a diff table from PR vs main Lighthouse JSONset -euo pipefailPR_REPORT=".lighthouseci/lhr-pr.json"MAIN_REPORT=".lighthouseci/lhr-main.json"if [ ! -f "$PR_REPORT" ] || [ ! -f "$MAIN_REPORT" ]; then echo "❌ Required JSON files are missing" >&2 exit 1fi# Pull four indicators with node and emit a CSV.node -e 'const pr = require("./.lighthouseci/lhr-pr.json");const base = require("./.lighthouseci/lhr-main.json");const k = ["largest-contentful-paint","total-blocking-time","cumulative-layout-shift","speed-index"];const fmt = n => typeof n === "number" ? n.toFixed(0) : n;console.log("metric,base,pr,delta");for (const id of k) { const a = base.audits[id].numericValue; const b = pr.audits[id].numericValue; console.log(`${id},${fmt(a)},${fmt(b)},${fmt(b - a)}`);}' > .lighthouseci/diff.csvcat .lighthouseci/diff.csv
Feeding the CSV — not the absolute numbers — to the AI shifts its reasoning to "what got worse" instead of "what is bad." Why does the diff matter so much? With absolute values alone, the AI conflates "this site has always been slow" with "this PR made it slow," and starts blaming unrelated files. Adding the diff lifts hypothesis quality by a noticeable margin, and the comments stop reading like generic performance lectures.
The MAIN_REPORT snapshot is worth a separate sentence. You need to be running Lighthouse against main on a schedule — daily is fine — and storing the result as the reference. A common mistake is to recompute the main baseline inline during the PR job, which doubles your CI time and means main's number depends on the runner's current load, not on main's actual state. Keep the baseline a separate, scheduled job and treat its output as ground truth.
Step 4: Let Antigravity AI explain the regression
This is the heart of the pipeline. Hand the diff CSV, the changed file list, and the package.json diff to the Antigravity Agent SDK (or to gemini-3-pro invoked via CLI inside Actions) and ask for hypotheses.
// scripts/perf-explain.mjs — Ask the AI for hypotheses about the regression// Called from CI; emits Markdown for the PR comment to stdout.import fs from "node:fs/promises";import { runAgent } from "@google/antigravity-agent"; // illustrative SDK wrapperconst diffCsv = await fs.readFile(".lighthouseci/diff.csv", "utf8");const changedFiles = await fs.readFile(".changed-files.txt", "utf8");const pkgDiff = await fs.readFile(".pkg-diff.txt", "utf8").catch(() => "");const prompt = `You are a Web Performance expert. From the inputs below, propose up to threehypotheses for what most likely caused the regression in this PR. Eachhypothesis MUST be grounded in a number from the diff CSV or in a specificchanged file. Skip hypotheses you cannot ground.# Lighthouse delta (main → PR)${diffCsv}# Changed files${changedFiles}# package.json diff (dependency changes)${pkgDiff || "(no changes)"}# Output format- Hypothesis 1: ... - Evidence: ... - Next step: ...- Hypothesis 2: ...(Up to three. If you only have one confident hypothesis, return only one.)`;const result = await runAgent({ model: "gemini-3-pro", prompt, temperature: 0.2, // keep CI inference low-variance maxOutputTokens: 1200, // keep PR comments short enough to read});// Don't swallow failures — surface them to the CI log.if (!result.text) { console.error("❌ AI returned an empty response"); process.exit(1);}console.log(result.text);
Expected output (the Markdown that becomes the PR comment) looks roughly like this:
- Hypothesis 1: hero image swap inflated the initial paint - Evidence: LCP +820 ms, changed files include `public/hero.png` - Next step: convert to WebP/AVIF and verify `loading="eager"` is preserved- Hypothesis 2: client JS bundle grew - Evidence: TBT +170 ms, `package.json` adds `chart.js` - Next step: dynamic import the chart code, or remove unused features
temperature: 0.2 because CI inference does not need flair — reviewers want predictable outputs they can pattern-match against past comments. maxOutputTokens: 1200 to prevent runaway cost; we revisit cost in Step 7.
The two phrases that earn the AI its keep are "Skip hypotheses you cannot ground" and "each hypothesis MUST be grounded in a number from the diff CSV." Without these instructions, the model will pad to three hypotheses every time, even when it only has one real lead, and reviewers learn to distrust the bot. With them, the model occasionally returns one well-grounded hypothesis instead of three speculations — and that is exactly the behavior you want from a CI assistant.
Step 5: Designing the PR comment — return a next step, not just a number
This is more about design than code. To stop the AI from becoming "that annoying bot," I follow three rules without exception.
One comment per PR, edit-in-place: use gh pr comment --edit-last or Octokit's issues.updateComment to update the existing comment. If you append a new comment per push, the review surface fills with AI noise and human comments get drowned.
Always order it metrics table → hypotheses → next step: with the numbers at the top, humans actually read the comment instead of scrolling past. Hypotheses without numbers above them feel like opinion; numbers without hypotheses below them feel like blame without analysis.
Never comment on metrics that didn't fail: "LCP was fine!" is what the CI status badge is for. AI bots earn trust by talking only about the negative. Compliments from the bot dilute the signal of its complaints.
The comment template I use:
## ⚠️ Lighthouse Performance Regression Detected| Metric | main | this PR | Δ ||---|---|---|---|| LCP (ms) | 1820 | 2640 | **+820** || TBT (ms) | 140 | 310 | **+170** |### 🔍 AI Hypotheses (2)(output of `perf-explain.mjs` goes here)---<sub>Generated by Antigravity AI. Comment `/perf rerun` to re-analyze.</sub>
Leaving an entry point like /perf rerun matters. When a reviewer doubts the AI's call, they can re-run on demand, and that turns a one-way bot into a small dialogue. The mental model I aim for is "CI is not just a gatekeeper, it's a conversational partner." Most CI bots fail this test — they shout once and disappear, leaving no recourse for the reviewer except to merge or close.
If you want to take the comment design further, add a <details> section at the bottom that includes the raw Lighthouse trace URL and the model's reasoning steps. Most reviewers will never expand it, but the few who do — usually the engineers who care most about performance — will spot edge cases the model missed and feed those corrections back into your prompt. That feedback loop is how the bot gets sharper over time.
Step 6: Progressive blocking — warning, then required, then blocker
Don't make budgets a merge blocker on day one. I run a three-stage rollout.
Stage 1 (first 2–4 weeks): everything is warn. Only the AI comment shows up. The team gets used to "if numbers regress, a comment appears." This is a culture-building stage; the goal is not to enforce anything, only to make the bot's existence visible and predictable.
Stage 2 (next 1–2 months): only TBT and CLS are promoted to error. Start with the metrics that most directly hurt user experience. TBT directly tracks "did the page feel responsive when I clicked?" and CLS directly tracks "did stuff move under my finger?" These are visceral; LCP regressions are easier to argue away.
Stage 3: full Performance score becomes an error. At the same time, introduce an "exception ticket — pay back within 3 business days" workflow for the rare cases where you need to merge anyway.
Why staged? If you go straight to "block on day one" before the review culture has formed, your team invents an opt-out label and quietly normalizes its use. In every team I've watched roll this out, the sites that gave themselves a month of "get used to the bot" experience ended up enforcing budgets far more cleanly later. The teams that tried to enforce on day one ended up with permanent escape hatches that nobody was willing to remove.
A small but useful detail: during Stage 1, post the comment with a clearly visible "This is informational. Merge is not blocked." line near the top. People read comments differently when they know whether they have to act on them.
Step 7: Cost and rate limiting — keep tokens from exploding
Whenever you put AI into CI, think about cost first. On a busy repo, you can suddenly stare at a four-figure monthly bill. The defenses I rely on:
Cap to five runs per PR per day: GitHub Actions concurrency groups plus a tiny KV store (GitHub Variables works as a substitute) handle the counting. The number "five" is not magic; it's "enough that a flaky run doesn't lock anyone out, few enough that a force-push loop doesn't bankrupt you."
Skip AI on tiny PRs: if fewer than three files changed, post the metrics table only. This kills "AI ran on a typo PR" waste.
Always set maxOutputTokens: 1,200–1,500 tokens is the sweet spot for readable PR comments.
Default to a flash-tier model: this kind of inference is not heavy; Gemini Flash is enough. Reserve pro as a second-stage rocket only for cases where Flash didn't land the right hypothesis.
// scripts/should-run-ai.mjs — Decide whether to invoke the AI stepimport { execSync } from "node:child_process";const baseSha = process.env.BASE_SHA;const prSha = process.env.PR_SHA;const changedFiles = execSync(`git diff --name-only ${baseSha} ${prSha}`) .toString() .trim() .split("\n") .filter(Boolean);// Threshold: skip AI on PRs that touch fewer than three files.if (changedFiles.length < 3) { console.log("SKIP_AI=true"); process.exit(0);}// Asset-only PRs (images / fonts) are still worth running AI on.const onlyAssets = changedFiles.every(f => /\.(png|jpe?g|webp|avif|woff2?)$/.test(f));if (onlyAssets) { console.log("RUN_AI=true REASON=assets-only"); process.exit(0);}console.log("RUN_AI=true");
Run this at the top of the workflow from Step 2 and short-circuit the AI job when SKIP_AI=true. Combined with maxOutputTokens, this took my real-world bill down to about a quarter of the naive setup.
One more cost trick: cache the AI prompt. If two consecutive pushes produce identical diff CSVs and identical changed-file lists (which happens more often than you'd think — re-running CI on a flake, for example), you can return the previous AI comment without invoking the model. A simple SHA over the inputs as a cache key is enough.
Step 8: Closing the loop — feeding RUM data back into the budget
The pipeline so far is closed: Lighthouse measures, AI explains, humans review. But Lighthouse is lab data, and lab data drifts from field data over time. Six months in, you'll discover that your CI says everything is fine while your real users are reporting jank. The fix is to let Real User Monitoring (RUM) data back into the budget periodically.
The simplest implementation is a weekly job that pulls Web Vitals from CrUX (or your RUM provider — Sentry, SpeedCurve, DebugBear, whatever you use), computes the median, and updates the lighthouserc.js thresholds via a PR. The PR is then reviewed by humans, not auto-merged — because budgets are a contract, and contract changes deserve a human signature.
// scripts/sync-budget-from-rum.mjs — Weekly job that proposes budget updatesimport fs from "node:fs/promises";// Pull last 7 days of field data (illustrative; replace with your RUM API).const fieldData = await fetch("https://api.example-rum.com/web-vitals?days=7", { headers: { Authorization: `Bearer ${process.env.RUM_TOKEN}` },}).then(r => r.json());const fieldLcp = fieldData.percentiles.p75.lcp; // p75 is the standard CrUX thresholdconst fieldTbt = fieldData.percentiles.p75.tbt;// Read current budget and propose a 5% tighter threshold than current field p75.const current = JSON.parse( await fs.readFile("lighthouserc.json", "utf8").catch(() => "{}"));const proposed = { ...current, lcpBudget: Math.round(fieldLcp * 0.95), tbtBudget: Math.round(fieldTbt * 0.95),};await fs.writeFile("lighthouserc.json", JSON.stringify(proposed, null, 2));console.log("✅ Proposed new budget:", proposed);
Run it weekly via a scheduled GitHub Action and have it open a PR titled "chore(perf): refresh budget from RUM (week of YYYY-MM-DD)." A human reviewer accepts or rejects the change. Over time, your budget stays anchored to reality instead of to a number you guessed in week one.
This step is the difference between a static budget that decays and a living budget that improves. Without it, every team eventually hits a moment where "CI passed but users are unhappy" — and once trust in the budget is broken, the whole pipeline loses its weight.
Common pitfalls
A few traps I've actually fallen into.
1. Preview-deployment caching makes scores look unrealistically good.
Vercel previews aggressively cache at the edge, and Lighthouse can come back with implausibly fast results. Append a cache-buster like ?_v=${PR_SHA} to the Lighthouse URL. I lost a week before realizing "wait, why is the PR faster than main?" was actually a caching artifact.
2. CI network noise causes flakiness.
Even with numberOfRuns: 3 and the median, runner contention can bounce timings by a hundred milliseconds. Bake a 5% buffer into your thresholds. Don't try to push them to the absolute edge of the current performance.
3. The AI blames unrelated files when given no diff.
If you only pass the changed-files list and skip the diff CSV, the AI will happily claim "you changed README.md, that's why it slowed down." Don't drop Step 3.
4. Multi-comment spam.
Mixing gh pr comment and gh pr comment --edit-last between runs ends with a PR full of AI noise. Standardize on one — --edit-last or Octokit's update API — and stick to it.
5. Forgetting the mobile preset.
A desktop-only run hides regressions that only show up on phones. Run a parallel job with preset: "mobile". Sixty to seventy percent of your real users are on mobile, and the gap between desktop and mobile widens over time as device fragmentation continues.
6. Treating LCP as the only metric that matters.
LCP is famous because it's in the Core Web Vitals branding, but TBT and CLS regressions hurt user trust faster. A 100 ms TBT increase on a list view feels worse than a 300 ms LCP increase on a static page, and your budget should reflect that with stricter TBT thresholds even when LCP is the headline metric in your dashboards.
7. Letting the AI prompt drift over time.
Add a comment in the prompt file with a "last reviewed" date, and put a reminder on your team's calendar to re-read the prompt every quarter. Prompts decay just like code, and a prompt that worked six months ago may now be producing irrelevant comments because your codebase has changed shape.
A Note from an Indie Developer
Closing — start with a "warn-only bot" this weekend
Don't try to ship the entire pipeline at once. The smallest useful version is "a warn-only Lighthouse CI you stand up this weekend" — no AI, no comments, just the Performance score showing on the PR.
Run that for two weeks. You'll learn the realistic shape of your own site: the LCP median, the per-PR variance, the pages that regress most often. Once that ground truth is in place, layer in Step 4 onward and you'll avoid the over-engineering trap.
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.