When the Tech-Debt Score Drops but the Same Files Keep Breaking — Field Notes on Instrumenting Fan-in and Churn

Letting Antigravity's architecture agent score technical debt is not enough — bugs often recur in the same files after refactoring. Here is how we instrumented the fan-in times churn that static complexity misses, and reconciled the score against real incidents.

antigravity⁴⁰⁶ agents¹¹⁶ tech-debt code-analysis refactoring⁸ dependency-mapping

✦ Premium Article

The score got healthier, but the fires stayed in the same place

On one project I had Antigravity's architecture agent score technical debt, then worked down the list from the top. Average cyclomatic complexity dropped, and the "critical" files in the report steadily thinned out. On paper, the codebase was getting healthier.

Yet the file names lined up in the month-end incident review barely changed. The recurring failures were landing in files with low scores — the supposedly clean ones.

After seeing that gap clearly once, I stopped letting static complexity alone decide the refactoring order. The score is a necessary signal, but on its own it does not tell you where things actually break. Here are the notes on how I measured that blind spot and filled it, with the code I actually used.

Why complexity scores miss the target

Cyclomatic and cognitive complexity capture how hard a file is to read the moment you open it. Deeply nested, branch-heavy functions really are more accident-prone. No argument there.

The problem is that complexity says nothing about how often a file will be touched, or how many modules a change ripples into. A complex function that is rarely edited keeps running quietly in its complexity. Meanwhile a file that looks simple but gets rewritten weekly and is referenced by many modules produces a small tear on every change. That is where the incidents happen.

Real risk correlates far more strongly with the product of change frequency (churn) and inbound references (fan-in) than with complexity itself. Neither of those falls out of static analysis alone. One lives in Git history, the other in the dependency graph. The score we hand the agent needs both baked in from the start.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Why cyclomatic complexity can fall without incidents falling, plus code that redefines hotspots as fan-in multiplied by churn

✦A reconciliation harness that measures how well the agent's debt ranking actually captures the files that incidents touched

✦An operational recipe for correcting the agent's refactoring order with incident history instead of trusting the static score

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Multiply fan-in by churn to surface hotspots

First, pull fan-in (how many modules reference a file) from the dependency graph, pull each file's change count from Git history, and multiply them. The goal is to surface files that are used from many places and rewritten often.

// scripts/hotspot-score.ts
// Surface risk hotspots via fan-in (inbound refs) x churn (change frequency).
// The point is to catch the "simple but frequently broken" files that
// static complexity alone lets sink.
 
import madge from "madge";
import { execSync } from "child_process";
import { writeFileSync, mkdirSync } from "fs";
 
interface Hotspot {
  file: string;
  fanIn: number;        // number of modules referencing this file
  churn: number;        // commits in the last 90 days that touched it
  linesChanged: number; // added + removed lines over the same window
  hotspotScore: number; // log-compressed fanIn x churn
}
 
// Count, per file, how many commits it appeared in over the window
function collectChurn(sinceDays = 90): Map<string, { commits: number; lines: number }> {
  const since = `--since=${sinceDays}.days.ago`;
  // --numstat emits "added\tremoved\tpath"; binaries show "-", so skip them.
  const raw = execSync(`git log ${since} --numstat --pretty=format:__COMMIT__`, {
    encoding: "utf8",
    maxBuffer: 64 * 1024 * 1024,
  });
 
  const churn = new Map<string, { commits: number; lines: number }>();
  const seenInCommit = new Set<string>();
 
  for (const line of raw.split("\n")) {
    if (line === "__COMMIT__") {
      seenInCommit.clear();
      continue;
    }
    const m = line.match(/^(\d+|-)\t(\d+|-)\t(.+)$/);
    if (!m) continue;
    const added = m[1] === "-" ? 0 : parseInt(m[1], 10);
    const removed = m[2] === "-" ? 0 : parseInt(m[2], 10);
    const path = m[3];
    if (!/\.(ts|tsx)$/.test(path)) continue;
 
    const cur = churn.get(path) ?? { commits: 0, lines: 0 };
    // avoid double-counting a file within the same commit
    if (!seenInCommit.has(path)) {
      cur.commits += 1;
      seenInCommit.add(path);
    }
    cur.lines += added + removed;
    churn.set(path, cur);
  }
  return churn;
}
 
export async function computeHotspots(srcPath = "./src"): Promise<Hotspot[]> {
  const result = await madge(srcPath, {
    fileExtensions: ["ts", "tsx"],
    excludeRegExp: [/\.test\./, /\.spec\./, /__tests__/],
    tsConfig: "./tsconfig.json",
  });
  const graph = result.obj();
 
  // Aggregate fan-in (how often each dependency is referenced)
  const fanIn = new Map<string, number>();
  for (const deps of Object.values(graph)) {
    for (const dep of deps) fanIn.set(dep, (fanIn.get(dep) ?? 0) + 1);
  }
 
  const churn = collectChurn(90);
 
  const hotspots: Hotspot[] = [];
  for (const [file, fi] of fanIn.entries()) {
    const c = churn.get(file);
    if (!c) continue; // untouched for 90 days -> out of scope this round
    // both fan-in and churn are long-tailed; compress with log before multiplying
    const score = Math.log2(fi + 1) * Math.log2(c.commits + 1);
    hotspots.push({
      file,
      fanIn: fi,
      churn: c.commits,
      linesChanged: c.lines,
      hotspotScore: Number(score.toFixed(2)),
    });
  }
 
  hotspots.sort((a, b) => b.hotspotScore - a.hotspotScore);
  mkdirSync(".architecture-analysis/scores", { recursive: true });
  writeFileSync(
    ".architecture-analysis/scores/hotspots.json",
    JSON.stringify(hotspots.slice(0, 50), null, 2)
  );
  return hotspots;
}
 
computeHotspots().then((h) => {
  console.log("Top hotspots:");
  for (const s of h.slice(0, 10)) {
    console.log(`  ${s.file}  fanIn=${s.fanIn} churn=${s.churn} score=${s.hotspotScore}`);
  }
});

The log compression matters because both fan-in and churn concentrate extreme values in a few giant files, so the raw product explodes in magnitude. Ranking by the raw product, some incidental huge config file always sits at number one, and the order stops matching intuition. With the log in place, the top ten started to line up with the gut feeling of "yes, this one scares us every time."

Calling git log --numstat exactly once and splitting on the __COMMIT__ marker to avoid double counts is also a practical detail. Running git log per file takes minutes on a project with thousands of files. Parsing a single log pass finishes in seconds.

Reconcile the score against real incidents

Producing hotspots is one thing; whether they actually correlate with incidents is another. Skip that check and you have merely added yet another plausible-but-wrong score. So I inserted a reconciliation step: compare the top of the ranking against the files that past incidents actually patched.

The incident source can be anything. In our case, incident-response commits carried a fix: prefix and an incident/<id> trailer, so the touched files are easy to pull.

// scripts/reconcile-score-vs-incidents.ts
// Measure how well the top-N of a ranking captures the files that
// incidents actually patched. Compares the static-complexity ranking
// and the hotspot ranking on the same footing.
 
import { execSync } from "child_process";
import { readFileSync } from "fs";
 
// Extract the set of files that incident-response commits touched
function incidentFiles(sinceDays = 180): Set<string> {
  const raw = execSync(
    `git log --since=${sinceDays}.days.ago --grep='incident/' -i --name-only --pretty=format:__C__`,
    { encoding: "utf8", maxBuffer: 64 * 1024 * 1024 }
  );
  const files = new Set<string>();
  for (const line of raw.split("\n")) {
    if (line === "__C__" || line.trim() === "") continue;
    if (/\.(ts|tsx)$/.test(line)) files.add(line);
  }
  return files;
}
 
// Recall@N: of the incident files, how many appear in the ranking's top N
function recallAtN(ranking: string[], truth: Set<string>, n: number): number {
  const topN = new Set(ranking.slice(0, n));
  let hit = 0;
  for (const f of truth) if (topN.has(f)) hit++;
  return truth.size === 0 ? 0 : Number((hit / truth.size).toFixed(3));
}
 
const truth = incidentFiles(180);
 
const hotspots: { file: string }[] = JSON.parse(
  readFileSync(".architecture-analysis/scores/hotspots.json", "utf8")
);
const complexity: { filePath: string }[] = JSON.parse(
  readFileSync(".architecture-analysis/scores/debt-scores.json", "utf8")
);
 
const hotspotRanking = hotspots.map((h) => h.file);
const complexityRanking = complexity.map((c) => c.filePath);
 
for (const n of [10, 20, 30]) {
  console.log(
    `Recall@${n}  complexity=${recallAtN(complexityRanking, truth, n)}  ` +
      `hotspot=${recallAtN(hotspotRanking, truth, n)}`
  );
}

The first time I ran this, the complexity ranking's Recall@20 sat around 0.2 — only about a fifth of the files that actually failed had made the complexity top 20. The fan-in-times-churn hotspots cleared 0.5 at the same Recall@20. If you are going to have the agent say "fix this first," that number is what decided the hotspot ranking should be the spine.

The important part is not treating this recall as a one-off check. Re-measure it each quarter in production, and you can see whether refactoring is working (hotspot recall drops because incidents fall where you fixed) or whether the fire simply moved elsewhere.

Signal	What it sees	Weakness alone	Operational role
Cyclomatic/cognitive complexity	Readability, branch density	Blind to frequency and ripple	Effort estimate at start
Fan-in	Inbound refs (ripple breadth)	Stable, untouched modules rank high too	Weighting impact radius
Churn	Recent change frequency	Spikes on big new drops	Filtering to "live" debt
Fan-in x churn	Broad-ripple, active files	Independent of incident history	Spine of the repay order
Incident reconciliation (Recall@N)	Whether the score caught real failures	Depends on past record quality	Validating the score itself

Correct the agent's prioritization with incident history

When you hand the scoring to Antigravity's architecture agent via AGENTS.md, feeding it static scores alone makes it obediently order proposals by "highest complexity first." Add the hotspot and reconciliation results as extra context, and the repay order it proposes moves closer to reality.

Concretely, I spelled out two things in the agent's instructions: "use the hotspot score, not the complexity score, as the primary key, and treat complexity only as a tiebreaker," and "for any quarter where the hotspot side falls below the complexity side on Recall@20, propose revisiting the reconciliation logic before proposing code changes."

# Architecture Analyst Agent (excerpt, revised)
 
## Prioritization Rules
- Primary key for refactoring order is hotspotScore in hotspots.json.
- Complexity (debt-scores.json) is a tiebreaker only; never orders alone.
- Before proposing, read the reconcile report; if hotspot Recall@20 is
  below complexity's, propose fixing the measurement logic before code.
- Attach "fanIn / churn / recent-incident?" as evidence for each item.
 
## Output
- Repay roadmap -> .architecture-analysis/reports/refactor-roadmap.md
- Each item includes an effort estimate and the dependency-ripple file list

What changed with this revision is that the agent started putting "files whose fixing reduces incidents" at the top, rather than "files that feel satisfying to clean." The same tool produces the score; the quality of the proposal depends heavily on what you make the primary key.

What running it taught me to watch for

Churn is a double-edged sword. Right after a burst of new feature work, that area's churn spikes and dominates the hotspots. But that only means development is active, not that the code is fragile. I apply a temporary decay factor to a region for the first two weeks after a big drop, then return it to normal weighting once it settles. Without this, the agent proposes refactoring code you are still actively building, which does not match the team on the ground.

Incident-reconciliation accuracy is bound to the quality of your past commit hygiene. Failures from before you started adding incident/ trailers simply cannot be picked up. Early on, watch the quarter-over-quarter trend rather than the absolute recall. The more records accumulate, the better reconciliation works.

And I treat the score as a starting point for a conversation, nothing more. Open the top hotspot files and confirm the team nods that "yes, these really do scare us every time." If the numbers disagree with the floor's instinct, the thing to doubt is the measurement, not the floor. Running several services in parallel as an indie developer, I am always tempted to start from "the red items in the report" — so I make a point of periodically inspecting how the red gets assigned in the first place.

As a next step, run git log --numstat once on the project in front of you and write out the top ten files by fan-in multiplied by the last 90 days of churn. See how much those ten overlap with the files raised in your most recent incident review. That overlap is the first answer to whether your score is catching real risk at all.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.