When Nobody Reads Your AI Code Reviewer Anymore — Field Notes on Measuring Actioned-Rate
Our production AI code-review agent quietly went hollow over six months. When the team started silently resolving every comment, we instrumented actioned-rate and false-positive rate to bring it back. These are the field notes.
When we first shipped the agent, everyone replied to its review comments. Six months later I opened a PR one morning and froze. All eleven comments the agent had left were marked resolved, in silence.
No discussion. No fix. No reply. Just quietly closed. The comments still looked as reasonable as they had six months earlier, yet nobody was reading them.
This was not a broken tool. It was a team that had been trained, by sheer volume, to close everything on sight. Even a small team of one indie developer runs into this quietly. A quiet kind of hollowing-out that you cannot see without numbers. These are the field notes on catching that signal through measurement and bringing the agent back to life.
"Running" and "Working" Are Different Metrics
For a long time we judged the health of the review agent by whether it was running. Green CI, comments posted, therefore healthy. Or so we thought.
What actually matters is whether a posted comment led to action. Unless you separate these, hollowing-out stays invisible forever.
Dimension
"Running" metric
"Working" metric
Uptime
CI success rate, comment count
—
Acceptance
—
Actioned-rate (comments that led to a fix or discussion)
Precision
—
False-positive rate (closed as wontfix / not-applicable)
Load
—
Median comments per PR
You can inflate comment count endlessly. The more you inflate it, the lower the actioned-rate falls. Miss that inverse relationship and one day the whole team stops reading.
Harvesting Actioned-Rate From PR Events
The first thing you need is a way to follow what happened to a comment afterward. GitHub review comments retain their resolution state and any replies. We harvest those.
Starting from each comment the agent posted, we classify the human behavior that followed into three buckets. A follow-up commit means actioned, a reply thread means discussed, and resolution with neither means ignored.
# collect_actioned_rate.py# Measure whether the agent's review comments led to actionimport os, requestsfrom collections import CounterREPO = os.environ["REPO"] # e.g. "owner/name"BOT = os.environ["BOT_LOGIN"] # the review agent's accountTOKEN = os.environ["GITHUB_TOKEN"]H = {"Authorization": f"Bearer {TOKEN}", "Accept": "application/vnd.github+json"}def paged(url, params=None): params = dict(params or {}, per_page=100) while url: r = requests.get(url, headers=H, params=params, timeout=30) r.raise_for_status() yield from r.json() url = r.links.get("next", {}).get("url") params = None # the next URL already carries the querydef classify(comment): # actioned: a change commit touched the same file after the comment # discussed: a human reply exists in the thread # ignored: neither, but resolved if comment["reply_count"] > 0: return "discussed" if comment["path_touched_after"]: return "actioned" return "ignored"
The key is not to condemn ignored on sight. An INFO-level note deserves to be ignored. It only means something once tied to severity. That is what the next section pulls apart.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A script that harvests actioned-rate and dismissal reasons straight from PR events
✦A rule for separating false positives from fatigue-driven dismissals by severity
✦A staged way to throttle comment volume and recover the actioned-rate
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
ignored contains two very different things. The comment was wrong (a false positive), and the comment was valid but went unread (fatigue). These demand opposite fixes.
A false positive is a quality problem. A fatigue dismissal is a volume problem. Look at them together and you cannot tell whether to chase precision or cut quantity.
To separate them we added one lightweight label at resolution time. Writing a reply is too heavy, so we mapped it to a single reaction emoji.
# Read a resolution reaction as a false-positive signalFALSE_POSITIVE_REACTIONS = {"-1", "confused"} # thumbs-down = wrong / confused = off-targetdef is_false_positive(comment): reactions = {r["content"] for r in comment["reactions"]} return bool(reactions & FALSE_POSITIVE_REACTIONS)def bucket(comment): if comment["state"] != "resolved": return "open" if is_false_positive(comment): return "false_positive" # quality problem if comment["reply_count"] == 0 and not comment["path_touched_after"]: return "fatigue_ignored" # volume problem return "actioned"
Our numbers at six months: false-positive rate sat at an acceptable 7%, yet fatigue dismissals made up 58% of all comments. The quality was fine. There were simply too many. The moment that became clear, our fix switched from "precision tuning" to "volume reduction."
Hold a Health Threshold Per Severity
A single overall actioned-rate lets the INFO flood dilute the value of HIGH out of sight. Holding a separate yardstick per severity was the practical move.
Here are the thresholds we run with. Adjust the numbers to your team's tolerance, but they make a useful starting point.
Severity
Expected actioned-rate
Tolerated false-positive rate
Cap per PR
HIGH (must fix)
85%+
under 3%
none (always surface it)
MEDIUM (recommended)
50%+
under 8%
5
LOW (optional)
20%+
—
3
INFO (reference)
not measured
—
on request only
When HIGH actioned-rate drops below threshold, that is a serious alert: comments that genuinely need fixing are going unread. A low LOW actioned-rate, by contrast, is normal. Comments safe to ignore should be ignored.
The first thing that happened after adopting this table: hiding INFO by default alone lifted HIGH actioned-rate from 72% to 89%. We removed no comments. The buried HIGH ones simply became visible again.
Throttle Comment Volume in Stages
Cutting volume all at once backfired. Sudden silence made the team worry the agent had broken. Throttling in stages is what stuck.
There is an order to it. Throttle the noisy INFO/LOW first, measure for a sprint or two, then touch the MEDIUM cap. HIGH stays untouched to the end.
# review-agent.config.yml — settings for staged throttlingseverity_gates: HIGH: { max_per_pr: null, always_visible: true } MEDIUM: { max_per_pr: 5, sort_by: confidence } # top 5 by confidence LOW: { max_per_pr: 3, collapse: true } # collapsed display INFO: { on_demand: true } # only on /review --verbose# Never repeat the same finding within one PR (main source of duplicate fatigue)dedupe: by: [rule_id, file] keep: highest_severity
dedupe had the largest effect. Flag the same rule violation twenty times in one file and people stop reading, however correct it is. Folding to a single highest-severity comment per file cut total comments by 40% and visibly restored the actioned-rate.
After throttling, always check the numbers. The actioned-rate should rise, and HIGH escapes (bugs a human later finds) should not increase. Only when both hold can you conclude you haven't over-throttled.
A One-Page Weekly Health Report
Measurement is meaningless unless it continues. We run this aggregation weekly in CI and post a one-page summary to Slack. What I quietly enjoy is that the agent runs it about itself.
# weekly_health.py — generate the weekly summarydef summarize(comments): total = len(comments) actioned = sum(1 for c in comments if bucket(c) == "actioned") fp = sum(1 for c in comments if bucket(c) == "false_positive") fatigue = sum(1 for c in comments if bucket(c) == "fatigue_ignored") high = [c for c in comments if c["severity"] == "HIGH"] high_rate = sum(1 for c in high if bucket(c) == "actioned") / max(len(high), 1) return { "actioned_rate": round(actioned / max(total, 1), 3), "false_positive_rate": round(fp / max(total, 1), 3), "fatigue_ignored_rate": round(fatigue / max(total, 1), 3), "high_actioned_rate": round(high_rate, 3), "median_per_pr": median_comments_per_pr(comments), }# Rule: high_actioned_rate < 0.85 -> alert immediately# fatigue_ignored_rate > 0.30 -> enter the volume-reduction phase
With this one page, hollowing-out changed from something "you notice one morning" into something "whose numbers tilted two weeks ago." Act at the first tilt and you never let it reach the state where everyone closes in silence.
Looking Back
The real enemy of a code-review agent was never the wrong comment itself. It was the indifference that accumulates from too many comments. People stop reading even correct feedback when there is too much of it. And because that shows up as silence, watching comment count alone will never reveal it.
Hold three separate yardsticks — actioned-rate, false-positive rate, fatigue-ignored rate. Vary the threshold by severity. Throttle in stages. Those three brought our agent back from "a tool everyone closes in silence" to "a tool everyone responds to on HIGH."
If your team's review comments are being quietly closed right now, start by measuring the actioned-rate. The numbers usually turn what we already half-sensed into something we can act on. I hope these notes help, and thank you for reading.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.