Your Code Gets Refactored While You Sleep — Building an Overnight Automation System with Antigravity Background Agent

One morning I opened my project terminal and found code I'd written the day before looking noticeably cleaner. Japanese comments had been added throughout, missing error handling was filled in, and oversized functions were split apart. I hadn't done any of it. The Background Agent had been running while I was asleep.

It didn't go smoothly at first. My initial attempt ended with the agent refactoring things in directions I hadn't intended, and I spent my first hour running git reset --hard. After rethinking the design, I now wake up every morning to a "reviewed code" branch waiting for my approval.

Why Background Agent Is Built for Overnight Work

Antigravity's Background Agent runs in a separate window from your active coding session. Most tutorials show it being used to write tests while you're working on a PR, but I've found it delivers its real value as an overnight worker. Here's why.

Interaction cost drops to zero. Background Agent works like email — you hand off a task and check results in the morning. No mid-process babysitting required.

It doesn't compete with your active session. Background Agent runs in its own sandbox, so overnight jobs don't touch your development environment while you sleep.

Context stays clean. Starting a fresh session means the agent evaluates your code without inheriting any baggage from a complex afternoon debugging session.

Understanding How Background Agent Actually Works

Before designing overnight automation, it's worth having an accurate mental model of how Background Agent behaves.

# Check running Background Agent sessions via Antigravity CLI
antigravity sessions list --status=running
 
# Example output:
# SESSION_ID         STATUS    STARTED     TASK
# bg-a1b2c3d4        running   02:14 JST   overnight-review
# bg-e5f6g7h8        idle      -           -

One important limitation: if you ask an agent to process too many files in a single session, it will hit the context limit and stop mid-way. This was the root cause of most of my early failures.

Task Design: The Right Granularity

The most consequential design decision in overnight automation is task granularity.

"Review the whole project" doesn't work. Beyond the context limit problem, the agent has no clear definition of "done" and may either run indefinitely or stop at an arbitrary point.

My solution: one task = one file or one functional module.

# .antigravity/tasks/nightly-review.md
 
## Overview
Improve code quality in the specified file and push changes to the review branch.
 
## Inputs
- FILE_PATH: {{target_file}}
- REVIEW_TYPE: {{review_type}}  # "comment" | "refactor" | "test"
 
## Steps
 
### When review_type = "comment"
1. Read the file
2. Evaluate the code from these angles:
   - Do function and variable names clearly express their intent?
   - Are complex sections explained with comments?
   - Is error handling present and appropriate?
3. Add explanatory comments (do not modify existing logic)
4. Stage the changes (git add)
 
### When review_type = "refactor"
1. Read the file
2. Identify the following issues:
   - Functions exceeding 50 lines
   - Duplicated logic
   - Magic numbers (candidates for named constants)
3. For each issue, list: original code / improved code / reason for change
4. Before applying any changes, verify:
   - Existing tests pass (run `npm test`)
   - Total changed lines are under 20
5. Only commit changes where tests pass
 
## Hard Constraints
- Do not delete files
- Do not add new import statements (suggest in comments only)
- If changes would exceed 50 lines, stop and produce a report only

The Hard Constraints section is the most important part. Agents aim for their best interpretation of your instructions, but that best guess can diverge from your intent. Stating what you don't want upfront prevents unintended large-scale changes.

Implementing the Automation Script

With the task definition in place, the next step is a script that automatically schedules Background Agent sessions for changed files.

#\!/usr/bin/env python3
"""
nightly_review.py
Run at the end of a dev session to schedule Background Agent on changed files.
 
Usage:
  python nightly_review.py --review-type comment
  python nightly_review.py --review-type refactor --since "24h"
"""
 
import subprocess
import sys
import json
import argparse
from pathlib import Path
from datetime import datetime
 
 
def get_changed_files(since: str = "24h") -> list[str]:
    """Get recently modified Python/TypeScript files from git history."""
    try:
        result = subprocess.run(
            ["git", "diff", "--name-only", "HEAD~1", "HEAD"],
            capture_output=True,
            text=True,
            check=True
        )
        files = result.stdout.strip().split("\n")
        
        target_extensions = {".py", ".ts", ".tsx", ".js", ".jsx"}
        filtered = [
            f for f in files
            if Path(f).suffix in target_extensions
            and Path(f).exists()       # exclude deleted files
            and not f.startswith("test_")  # skip test files
            and "generated" not in f   # skip auto-generated files
        ]
        return filtered
 
    except subprocess.CalledProcessError as e:
        print(f"❌ git diff failed: {e.stderr}", file=sys.stderr)
        return []
 
 
def schedule_background_agent(file_path: str, review_type: str) -> dict:
    """
    Schedule a Background Agent session for the given file.
    
    Returns:
        {"session_id": str, "status": "scheduled" | "failed", "file": str}
    """
    task_prompt = f"""
    Please run a '{review_type}' review on the following file.
    
    File: {file_path}
    Task definition: refer to .antigravity/tasks/nightly-review.md
    
    When done, output JSON in this format:
    {{
      "file": "{file_path}",
      "changes_made": true/false,
      "summary": "Summary of changes",
      "issues_found": ["issue 1", "issue 2"],
      "skipped_reasons": []
    }}
    """
 
    try:
        result = subprocess.run(
            [
                "antigravity", "agent", "run",
                "--background",
                "--task", task_prompt,
                "--workspace", str(Path.cwd()),
                "--output-format", "json"
            ],
            capture_output=True,
            text=True,
            timeout=10
        )
 
        if result.returncode \!= 0:
            return {
                "session_id": None,
                "status": "failed",
                "file": file_path,
                "error": result.stderr
            }
 
        session_info = json.loads(result.stdout)
        return {
            "session_id": session_info.get("session_id"),
            "status": "scheduled",
            "file": file_path
        }
 
    except subprocess.TimeoutExpired:
        return {"session_id": None, "status": "failed", "file": file_path, "error": "timeout"}
    except json.JSONDecodeError as e:
        return {"session_id": None, "status": "failed", "file": file_path, "error": f"JSON parse error: {e}"}
 
 
def main():
    parser = argparse.ArgumentParser(description="Schedule overnight code review")
    parser.add_argument("--review-type", choices=["comment", "refactor", "test"], default="comment")
    parser.add_argument("--since", default="24h")
    parser.add_argument("--dry-run", action="store_true")
    args = parser.parse_args()
 
    files = get_changed_files(args.since)
 
    if not files:
        print("📭 No changed files — skipping schedule")
        return
 
    print(f"📋 Target files: {len(files)}")
    for f in files:
        print(f"  - {f}")
 
    if args.dry_run:
        print("\n🔍 dry-run mode — no sessions will be scheduled")
        return
 
    # Cap concurrent sessions to avoid exhausting the quota overnight
    MAX_CONCURRENT = 3
    results = []
 
    for i, file_path in enumerate(files[:MAX_CONCURRENT]):
        print(f"\n🤖 Scheduling ({i+1}/{min(len(files), MAX_CONCURRENT)}): {file_path}")
        result = schedule_background_agent(file_path, args.review_type)
        results.append(result)
 
        if result["status"] == "scheduled":
            print(f"  ✅ Session: {result['session_id']}")
        else:
            print(f"  ❌ Failed: {result.get('error', 'unknown error')}")
 
    # Save report for morning review
    report_path = Path(".antigravity/nightly-report.json")
    report_path.parent.mkdir(exist_ok=True)
    with open(report_path, "w") as f:
        json.dump({
            "scheduled_at": datetime.now().isoformat(),
            "review_type": args.review_type,
            "sessions": results
        }, f, indent=2)
 
    print(f"\n📄 Report saved: {report_path}")
 
 
if __name__ == "__main__":
    main()

The key design choice here is capping concurrent sessions at 3. Spinning up too many Background Agents overnight will drain your Antigravity quota before your morning coffee. Three sessions per night leaves enough quota for productive daytime development.

Safety Mechanisms: Two-Stage Rollback Design

The nightmare scenario in overnight automation is waking up to unfamiliar code that's somehow already committed. Two layers of protection prevent this.

Layer 1: Task definition constraints

The hard constraints in nightly-review.md act as guardrails. The 50-line change limit was the single most effective rule I added — after introducing it, large-scale unintended refactors stopped entirely.

Layer 2: Automated verification and rollback

#\!/bin/bash
# verify_nightly_changes.sh
# Verify overnight changes and roll back if tests fail.
 
set -e
 
REPORT_FILE=".antigravity/nightly-report.json"
REVIEW_BRANCH="nightly-review-$(date +%Y%m%d)"
 
echo "🔍 Verifying overnight changes..."
 
if [ \! -f "$REPORT_FILE" ]; then
    echo "📭 No report found — no overnight tasks ran"
    exit 0
fi
 
CHANGED_FILES=$(git diff --name-only HEAD)
 
if [ -z "$CHANGED_FILES" ]; then
    echo "📭 No changes detected"
    exit 0
fi
 
echo "📝 Changed files:"
echo "$CHANGED_FILES"
 
echo ""
echo "🧪 Running tests..."
 
if \! npm test --silent 2>&1; then
    echo ""
    echo "❌ Tests failed — rolling back overnight changes"
 
    # Stash instead of discard — so you can still review what was proposed
    git stash push -m "nightly-review-$(date +%Y%m%d)-failed"
 
    cat >> "$REPORT_FILE" << EOF
 
{
  "verification": "FAILED",
  "rollback_at": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
  "stash_ref": "nightly-review-$(date +%Y%m%d)-failed",
  "message": "Rolled back due to failing tests. Run 'git stash pop' to restore changes for manual review."
}
EOF
 
    echo "💾 Changes saved to git stash"
    echo "   To review: git stash pop"
    exit 1
fi
 
echo ""
echo "✅ Tests passed — committing to review branch"
 
git checkout -b "$REVIEW_BRANCH" 2>/dev/null || git checkout "$REVIEW_BRANCH"
git add -A
git commit -m "chore: nightly auto-review $(date +%Y/%m/%d)
 
Automated review by Background Agent
 
$(git diff HEAD~1 --stat 2>/dev/null || echo 'Initial commit')
 
Generated by: nightly_review.py
Review type: $(jq -r '.review_type' "$REPORT_FILE" 2>/dev/null || echo 'unknown')
"
 
echo ""
echo "🎉 Done\!"
echo "   Branch: $REVIEW_BRANCH"
echo "   Review: git checkout $REVIEW_BRANCH && git diff main"

Note that failed changes are stashed, not discarded. Even when tests don't pass, the agent's suggestions are worth reading — you might agree with the direction while rejecting the specific implementation.

Common Pitfalls and How to Debug Them

Three months of running this system taught me which problems recur most often.

Pitfall 1: Ambiguous task instructions

The most frequent cause of unexpected behavior. "Improve the code" gives the agent room to do almost anything — add imports, restructure files, change architecture. The fix is task definitions written as concrete action lists starting with verbs, with explicit judgment criteria included.

Pitfall 2: Context window overflow

Large files (500+ lines) processed in a single session will hit the context limit mid-way, leaving changes in an incomplete state.

The solution is a pre-flight check before scheduling:

def is_suitable_for_overnight_review(file_path: str) -> tuple[bool, str]:
    """
    Returns (is_suitable, reason)
    """
    path = Path(file_path)
 
    try:
        content = path.read_text(encoding="utf-8")
    except (FileNotFoundError, PermissionError) as e:
        return False, f"File read error: {e}"
 
    lines = content.split("\n")
    if len(lines) > 400:
        return False, f"Too long ({len(lines)} lines, limit 400)"
 
    # Skip auto-generated files
    auto_gen_markers = [
        "# DO NOT EDIT",
        "// This file is auto-generated",
        "// Code generated by",
        "/* eslint-disable */"
    ]
    header = "\n".join(lines[:10])
    for marker in auto_gen_markers:
        if marker in header:
            return False, f"Auto-generated file ({marker})"
 
    # Skip config files
    config_patterns = ["config", "settings", "constants"]
    if any(p in path.name.lower() for p in config_patterns):
        return False, "Config file"
 
    return True, "Suitable"

Pitfall 3: Changes committed directly to main

Early on I was running agents on the main branch, so their changes went straight to production history. Now I enforce review-specific branches in the task definition:

## Branch Policy
- Create branch `nightly-review-{YYYYMMDD}` before starting
- Direct commits to main are prohibited
- All changes must land on the review branch

Pitfall 4: Quota exhaustion

In my first week, running 5 concurrent sessions overnight used up the entire day's quota before I even opened my laptop. Capping at 3 concurrent sessions and scoping targets to files I personally wrote in the last 24 hours solved the problem entirely.

GitHub Actions Integration

Running these scripts manually every evening defeats the purpose. Here's the GitHub Actions workflow that automates the scheduling:

# .github/workflows/nightly-review.yml
name: Nightly Code Review
 
on:
  schedule:
    # 14:00 UTC = 23:00 JST — after the typical end of a dev session
    - cron: '0 14 * * *'
  workflow_dispatch:
 
jobs:
  nightly-review:
    runs-on: ubuntu-latest
 
    steps:
      - name: Checkout
        uses: actions/checkout@v4
        with:
          fetch-depth: 2
 
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'
 
      - name: Check for changes
        id: check-changes
        run: |
          CHANGED=$(git diff --name-only HEAD~1 HEAD -- '*.py' '*.ts' '*.tsx' | wc -l)
          echo "changed_count=$CHANGED" >> $GITHUB_OUTPUT
          echo "skip=$([ "$CHANGED" -eq 0 ] && echo true || echo false)" >> $GITHUB_OUTPUT
 
      - name: Schedule Background Agent
        if: steps.check-changes.outputs.skip == 'false'
        env:
          ANTIGRAVITY_API_KEY: ${{ secrets.ANTIGRAVITY_API_KEY }}
        run: |
          python nightly_review.py \
            --review-type comment \
            --since 24h
 
      - name: Write summary
        if: always()
        run: |
          if [ -f ".antigravity/nightly-report.json" ]; then
            echo "## Nightly Review Results" >> $GITHUB_STEP_SUMMARY
            python -c "
          import json, sys
          data = json.load(open('.antigravity/nightly-report.json'))
          sessions = data.get('sessions', [])
          print(f'Scheduled: {len([s for s in sessions if s[\"status\"] == \"scheduled\"])}')
          print(f'Failed: {len([s for s in sessions if s[\"status\"] == \"failed\"])}')
          " >> $GITHUB_STEP_SUMMARY
          fi

Set ANTIGRAVITY_API_KEY to your Antigravity API key in GitHub Secrets (placeholder: YOUR_ANTIGRAVITY_API_KEY).

30 Days of Production Data

After running this system for 30 days, patterns emerged clearly.

Where the agent consistently adds value: adding explanatory comments, flagging variable name improvements (I apply the actual renames myself after reviewing), identifying duplicate logic, and surfacing missing error handling. These are incremental but compound meaningfully over weeks.

Where the agent falls short: architecture-level decisions, anything involving business logic, performance tuning, and refactoring test code itself. These require human judgment.

The most unexpected finding was that taking one or two nights off per week actually improved productivity. When a review branch lands every single morning, the review overhead starts to feel like a chore. Front-loading automation early in the week, then spending the second half reading your own code at a slower pace, turned out to be a better rhythm.

For the foundational Background Agent concepts, see the practical Background Agent guide. For more advanced orchestration patterns, the advanced Background Agent guide covers parallel execution and CI/CD integration in depth.

Your Next Step

Overnight automation is not about having AI write your code — it's about arriving at work in the morning with better context for the day ahead.

Start with one experiment tonight: pick one file you worked on today, run antigravity agent run --background with the prompt "add explanatory comments to this file," and see what you find tomorrow morning.

That small test will tell you more about where this fits into your workflow than any amount of reading.

Deep Dive: Designing Effective Task Prompts for Overnight Use

After 30 days of testing different prompt structures, I arrived at a consistent pattern that yields reliable results. The key insight is that Background Agent performs best when given concrete judgment criteria rather than abstract goals.

The ACRE Framework for Task Prompts

I call the approach I settled on "ACRE": Action, Constraint, Result format, Exit condition.

# ACRE Task Template
 
## Action
[Precise verb + specific target]
Example: "Add inline comments to functions longer than 10 lines in {{FILE_PATH}}"
 
## Constraints
[What NOT to do — be exhaustive]
- Do not modify function signatures
- Do not change variable names
- Do not add new dependencies
- If the change would exceed 20 lines, stop and report instead
 
## Result format
[Exact output expected]
Output a JSON object:
{
  "file": "path/to/file",
  "comments_added": 5,
  "functions_annotated": ["functionA", "functionB"],
  "skipped": [],
  "concerns": ["optional list of things a human should review"]
}
 
## Exit conditions
[When to stop]
- Tests fail after a change → revert that change and continue to next item
- File exceeds 400 lines → skip with reason "file_too_large"
- Any required import not already in the file → skip and report

The concerns field in the result format is something I added after a few weeks. Agents often notice issues they can't fix within the constraints — naming inconsistencies, potential null pointer situations, logic that looks like it might have a bug. Having a structured place to surface these observations turned out to be one of the most valuable parts of the system.

Writing Constraints That Actually Stick

Constraints work better when written as specific examples rather than abstract rules.

Compare these two versions:

❌ Less effective: "Don't make large changes."

✅ More effective: "If the total number of added or removed lines would exceed 20, stop processing this file and output: { 'skipped': true, 'reason': 'change_too_large', 'proposed_change_size': N }"

The second version eliminates ambiguity about what "large" means and tells the agent exactly what to output when the limit is hit. Agents are much better at following rules when the expected output for the failure case is specified alongside the rule itself.

Monitoring and Observability: Knowing What Happened While You Slept

A system that runs overnight without any observability is a black box — and black boxes are hard to trust. I built a simple morning dashboard that aggregates the results from Background Agent sessions.

#\!/usr/bin/env python3
"""
morning_report.py
Summarize overnight Background Agent activity. Run this first thing in the morning.
"""
 
import json
import subprocess
from pathlib import Path
from datetime import datetime, date
from typing import Optional
 
 
def load_nightly_report() -> Optional[dict]:
    """Load the nightly report generated by nightly_review.py"""
    report_path = Path(".antigravity/nightly-report.json")
    if not report_path.exists():
        return None
    try:
        return json.loads(report_path.read_text())
    except json.JSONDecodeError:
        return None
 
 
def get_session_results(session_id: str) -> Optional[dict]:
    """
    Fetch the output of a completed Background Agent session.
    Returns None if the session hasn't completed or output isn't parseable JSON.
    """
    try:
        result = subprocess.run(
            ["antigravity", "sessions", "output", session_id, "--format", "json"],
            capture_output=True,
            text=True,
            timeout=5
        )
        if result.returncode \!= 0:
            return None
        return json.loads(result.stdout)
    except (subprocess.TimeoutExpired, json.JSONDecodeError):
        return None
 
 
def summarize_git_changes() -> dict:
    """Summarize what changed in the git history overnight."""
    try:
        # Get commits since midnight
        since_midnight = datetime.combine(date.today(), datetime.min.time()).isoformat()
        log_result = subprocess.run(
            ["git", "log", f"--since={since_midnight}", "--oneline"],
            capture_output=True,
            text=True,
            check=True
        )
        commits = [l for l in log_result.stdout.strip().split("\n") if l]
 
        # Get changed files
        diff_result = subprocess.run(
            ["git", "diff", "--stat", "HEAD~1", "HEAD"],
            capture_output=True,
            text=True,
            check=True
        )
 
        return {
            "commits_overnight": len(commits),
            "commit_messages": commits,
            "diff_summary": diff_result.stdout.strip()
        }
    except subprocess.CalledProcessError:
        return {"commits_overnight": 0, "commit_messages": [], "diff_summary": ""}
 
 
def main():
    print("=" * 60)
    print(f"🌅 Morning Report — {date.today().strftime('%Y/%m/%d')}")
    print("=" * 60)
 
    report = load_nightly_report()
 
    if report is None:
        print("\n📭 No overnight activity found.")
        print("   Either no changes were detected, or the workflow didn't run.")
        return
 
    scheduled_at = report.get("scheduled_at", "unknown")
    review_type = report.get("review_type", "unknown")
    sessions = report.get("sessions", [])
 
    print(f"\n⏰ Scheduled at: {scheduled_at}")
    print(f"📋 Review type: {review_type}")
    print(f"🤖 Sessions: {len(sessions)}")
 
    successful = [s for s in sessions if s["status"] == "scheduled"]
    failed = [s for s in sessions if s["status"] == "failed"]
 
    if failed:
        print(f"\n⚠️  {len(failed)} session(s) failed to start:")
        for s in failed:
            print(f"   - {s['file']}: {s.get('error', 'unknown error')}")
 
    # Fetch and display session results
    print("\n📊 Session Results:")
    all_concerns = []
 
    for session in successful:
        session_id = session.get("session_id")
        file_path = session.get("file", "unknown")
 
        result = get_session_results(session_id) if session_id else None
 
        if result is None:
            print(f"\n  [{file_path}]")
            print(f"    Status: Session {session_id} — output not yet available")
            continue
 
        print(f"\n  [{file_path}]")
        print(f"    Changes made: {result.get('changes_made', '?')}")
 
        if summary := result.get("summary"):
            print(f"    Summary: {summary}")
 
        if issues := result.get("issues_found", []):
            print(f"    Issues found: {len(issues)}")
            for issue in issues[:3]:  # Show top 3
                print(f"      • {issue}")
 
        if concerns := result.get("concerns", []):
            all_concerns.extend([(file_path, c) for c in concerns])
 
    # Surface concerns that need human attention
    if all_concerns:
        print("\n🔍 Items needing human review:")
        for file_path, concern in all_concerns:
            print(f"   [{file_path}] {concern}")
 
    # Git activity summary
    git_summary = summarize_git_changes()
    if git_summary["commits_overnight"] > 0:
        print(f"\n📝 Git activity ({git_summary['commits_overnight']} commit(s)):")
        for msg in git_summary["commit_messages"]:
            print(f"   {msg}")
 
    print("\n" + "=" * 60)
    print("Next step: git checkout nightly-review-$(date +%Y%m%d) && git diff main")
    print("=" * 60)
 
 
if __name__ == "__main__":
    main()

Running python morning_report.py as the first command of the day takes about 10 seconds and gives a complete picture of what happened overnight. The concerns field is particularly valuable — it surfaces things the agent noticed but couldn't fix within its constraints.

Advanced Pattern: Tiered Overnight Processing

After running the basic system for about three weeks, I introduced tiered processing: different types of review on different nights of the week.

The weekly schedule I settled on:

Monday night — "comment sweep": The lightest pass. Agents add or improve comments on everything I wrote that day. No code changes, just documentation. Lowest risk, nearly always correct.

Tuesday and Wednesday nights — "refactor candidates": Agents identify (but don't apply) refactoring opportunities and output them as suggestions. I review the suggestions over breakfast and apply the ones I agree with manually.

Thursday night — "test coverage check": Agents look for functions without test coverage and output a list of what's missing. They don't write the tests themselves — that turned out to produce unreliable results — but flagging the gaps is genuinely useful.

Friday and weekends — no automation: I read my own code on Fridays. The manual reading pass often catches things the agents miss entirely, and it helps me maintain an accurate mental model of the codebase.

Implementing the Weekly Schedule

The tiered approach is straightforward to implement in GitHub Actions:

# .github/workflows/nightly-tiered.yml
name: Tiered Nightly Review
 
on:
  schedule:
    - cron: '0 14 * * 1'    # Monday 23:00 JST — comment sweep
    - cron: '0 14 * * 2,3'  # Tue/Wed — refactor candidates
    - cron: '0 14 * * 4'    # Thursday — test coverage check
  workflow_dispatch:
    inputs:
      review_type:
        description: 'Review type'
        required: true
        default: 'comment'
        type: choice
        options:
          - comment
          - refactor
          - test
 
jobs:
  tiered-review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 2
 
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
 
      - name: Determine review type from schedule
        id: review-type
        run: |
          DAY=$(date +%u)  # 1=Mon, 2=Tue, ..., 7=Sun
          if [ "$DAY" = "1" ]; then
            echo "type=comment" >> $GITHUB_OUTPUT
          elif [ "$DAY" = "2" ] || [ "$DAY" = "3" ]; then
            echo "type=refactor" >> $GITHUB_OUTPUT
          elif [ "$DAY" = "4" ]; then
            echo "type=test" >> $GITHUB_OUTPUT
          else
            echo "type=skip" >> $GITHUB_OUTPUT
          fi
 
      - name: Run nightly review
        if: steps.review-type.outputs.type \!= 'skip'
        env:
          ANTIGRAVITY_API_KEY: ${{ secrets.ANTIGRAVITY_API_KEY }}
        run: |
          python nightly_review.py \
            --review-type ${{ steps.review-type.outputs.type }} \
            --since 24h

The workflow_dispatch input lets you trigger any review type manually — useful when you want to run a refactor pass on a Thursday without waiting for the automated schedule.

Calibrating the System to Your Codebase

No two codebases are the same, and the task definitions that work well for a TypeScript monorepo may not translate directly to a Python data pipeline. Here are the calibration steps I'd recommend for a new project.

Week 1: Run comment-only reviews with --dry-run first, then with actual changes. Observe what kind of comments the agent adds. Are they useful? Too verbose? Focused on the wrong things? Adjust the task definition accordingly.

Week 2: Introduce refactor candidates — but have the agent output proposals only, not apply them. Review 10-20 proposals manually. This calibrates your expectations and helps you tighten the constraints.

Week 3: If you're confident in the constraints, turn on automatic application of refactor changes (with the safety rollback script in place).

Week 4 onward: Tune the file selection criteria. You'll notice patterns — certain file types or modules that consistently produce good suggestions, and others that don't. Narrowing the target set improves both quality and quota efficiency.

When to Not Use Overnight Automation

This system has real limitations, and I'd rather be honest about them than oversell the approach.

Don't use it on files undergoing active design changes. If you're in the middle of a major architectural shift, automated overnight refactoring on those files will generate noise at best and conflicts at worst. Disable automation on in-flight modules using a simple flag file:

# In get_changed_files(), add this check:
AUTOMATION_SKIP_FILE = Path(".antigravity/skip-automation")
if AUTOMATION_SKIP_FILE.exists():
    skip_patterns = AUTOMATION_SKIP_FILE.read_text().strip().split("\n")
    filtered = [f for f in filtered if not any(p in f for p in skip_patterns)]

Don't expect it to catch security vulnerabilities. The agent is good at code style and structural issues. Security review requires specialized prompting and should be a separate, explicitly designed task — not a side effect of the nightly comment sweep.

Don't rely on it for business logic review. The agent doesn't know your product requirements, your users, or your domain constraints. It can tell you that a function is complex; it cannot tell you whether the complexity is justified.

The clearest mental model I've found: Background Agent overnight automation is like having a very thorough code linter that can also write in prose. Invaluable for what it covers. Not a substitute for a human reviewer who understands the product.

Wrapping up

The most important thing I've learned from building and running this system is that the setup work is front-loaded, but the ongoing maintenance is minimal. Spend an afternoon designing your task definitions carefully, test them thoroughly with --dry-run, and the system largely runs itself.

If there's one thing worth getting right before you ship it to GitHub Actions, it's the rollback mechanism. The stash-rather-than-discard approach has saved me multiple times — not because the agent made catastrophic changes, but because stashed changes often contain useful observations I'd otherwise lose.

For more on Background Agent fundamentals, the practical guide is the best starting point. For production-scale patterns including parallel agent orchestration, see the advanced guide.

Start tonight with a single file and a single --background session. That's all it takes to see whether this fits your workflow.