ANTIGRAVITY LABJP
Articles/AI Tools
AI Tools/2026-06-14Intermediate

Routing thinking_level by Task in Gemini 3.5 Flash — Re-measuring My Token Spend

After Gemini 3.5 Flash became Antigravity's default, my thinking tokens crept up quietly. Here's how I measured thinking_level per task type and landed on a setup that cuts spend without losing accuracy.

Gemini 3.5 FlashAntigravity230thinking_levelToken OptimizationIndie Developer3Cost Optimization4

Premium Article

The week Gemini 3.5 Flash became the default Flash model in Antigravity 2.0, my monthly token-usage graph started climbing a little earlier than usual.

The coding felt just as fast. But when I opened the breakdown, the share of "thinking tokens" was clearly higher than before. I had been using Flash as my mental shorthand for "the fast model," so that mismatch nagged at me.

Running four sites in parallel as an indie developer, I fire agents hundreds of times a day. Even a tiny per-call difference adds up to a real number by month's end. So I sat down to re-measure thinking_level on a task-by-task basis. This article is that record — the measurements, and the setup I eventually settled on.

thinking_level Pays Off on Tasks That Don't Need Thinking

Introduced in the Gemini 3 line, thinking_level specifies, in steps, how much budget the model spends on internal reasoning before answering. It centers on low and high, controllable from Antigravity's model settings or the SDK's thinking_config. The 2.5-era thinking_budget (a direct token count) still coexists, and I switch between them depending on the case.

The easy thing to miss: thinking tokens are billed as output tokens. Even if the final answer is three lines, if the model "thought" for 2,000 tokens beforehand, that lands on the bill.

So the biggest room for optimization isn't in hard tasks. It's in tasks that barely need thought at all — "rename these variables," "sort the import statements" — where the model over-reasons anyway. Nudging those toward low drops spend without touching quality.

Run the Same Task Across Levels and Measure

Talking from gut feel isn't reproducible, so I took three representative task types and ran each, with identical input, through both low and high, recording thinking tokens and candidate tokens. I pull the numbers from usage_metadata in the google-genai SDK.

from google import genai
from google.genai import types
 
client = genai.Client()  # reads GEMINI_API_KEY from the environment
 
def measure(prompt: str, level: str) -> dict:
    response = client.models.generate_content(
        model="gemini-3.5-flash",
        contents=prompt,
        config=types.GenerateContentConfig(
            thinking_config=types.ThinkingConfig(thinking_level=level),
        ),
    )
    usage = response.usage_metadata
    return {
        "level": level,
        "thinking_tokens": usage.thoughts_token_count or 0,
        "output_tokens": usage.candidates_token_count or 0,
        "text": response.text,
    }
 
# measure the same prompt at two levels
for lv in ("low", "high"):
    r = measure("Normalize the variable names in this function to camelCase", lv)
    print(lv, r["thinking_tokens"], r["output_tokens"])

Averaging repeated runs in my environment, I landed roughly here (treat these as reference points — they shift with input and model updates):

For a mechanical formatting task (normalizing variable names), low averaged about 110 thinking tokens and high about 1,650 — a 15x gap, yet the emitted code was identical. Using high on a task that needs no thinking is almost pure waste.

For a mid-weight refactor (a rewrite that splits responsibilities), low was about 340 and high about 1,900. Here the output diverged. low produced a coarser split, and one in two runs needed a manual fix. high landed close to intent on the first pass.

For a task with a design judgment (where to draw the boundary of a data-access layer), low stayed surface-level, while high spent about 2,400 tokens and came back referencing the preconditions. Here the extra spend is clearly worth it.

The conclusion is simple: the spend-versus-accuracy tradeoff is mostly decided by task type, and a one-size-fits-all level loses in both directions.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Measured thinking-token gap between low and high (roughly 12–15x in my setup)
A Python router that auto-assigns thinking_level across three task tiers
Production pitfalls I hit (missing thought_signatures, low-level misses) and how I worked around them
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

AI Tools2026-05-06
3 Weeks with Gemini 2.5 Flash as My Default in Antigravity — Speed, Accuracy, and Cost in Practice
What actually changes when you switch from Gemini 2.5 Pro to Flash as your default model in Antigravity? Three weeks of real usage data, including where Flash falls short and when Pro still earns its place.
AI Tools2026-06-14
Pairing a Local LLM With Antigravity to Keep Sensitive Code Off the Cloud
Should you really let a cloud agent read code that holds your billing keys and revenue logic? For indie developers that worry is concrete. Here I pair Ollama and Gemma as a local LLM with Antigravity, routing sensitive parts to local and general parts to the cloud, with the decision rules and measurements.
AI Tools2026-05-25
Chrome DevTools for agents 1.0 goes stable — what changes when it ships inside Antigravity 2.0
Google has shipped Chrome DevTools for agents 1.0 as a stable release and bundled it into Antigravity 2.0. Here is what becomes possible, how it differs from Playwright MCP and Claude in Chrome, and what to watch out for on day one.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →