Routing thinking_level by Task in Gemini 3.5 Flash — Re-measuring My Token Spend
After Gemini 3.5 Flash became Antigravity's default, my thinking tokens crept up quietly. Here's how I measured thinking_level per task type and landed on a setup that cuts spend without losing accuracy.
The week Gemini 3.5 Flash became the default Flash model in Antigravity 2.0, my monthly token-usage graph started climbing a little earlier than usual.
The coding felt just as fast. But when I opened the breakdown, the share of "thinking tokens" was clearly higher than before. I had been using Flash as my mental shorthand for "the fast model," so that mismatch nagged at me.
Running four sites in parallel as an indie developer, I fire agents hundreds of times a day. Even a tiny per-call difference adds up to a real number by month's end. So I sat down to re-measure thinking_level on a task-by-task basis. This article is that record — the measurements, and the setup I eventually settled on.
thinking_level Pays Off on Tasks That Don't Need Thinking
Introduced in the Gemini 3 line, thinking_level specifies, in steps, how much budget the model spends on internal reasoning before answering. It centers on low and high, controllable from Antigravity's model settings or the SDK's thinking_config. The 2.5-era thinking_budget (a direct token count) still coexists, and I switch between them depending on the case.
The easy thing to miss: thinking tokens are billed as output tokens. Even if the final answer is three lines, if the model "thought" for 2,000 tokens beforehand, that lands on the bill.
So the biggest room for optimization isn't in hard tasks. It's in tasks that barely need thought at all — "rename these variables," "sort the import statements" — where the model over-reasons anyway. Nudging those toward low drops spend without touching quality.
Run the Same Task Across Levels and Measure
Talking from gut feel isn't reproducible, so I took three representative task types and ran each, with identical input, through both low and high, recording thinking tokens and candidate tokens. I pull the numbers from usage_metadata in the google-genai SDK.
from google import genaifrom google.genai import typesclient = genai.Client() # reads GEMINI_API_KEY from the environmentdef measure(prompt: str, level: str) -> dict: response = client.models.generate_content( model="gemini-3.5-flash", contents=prompt, config=types.GenerateContentConfig( thinking_config=types.ThinkingConfig(thinking_level=level), ), ) usage = response.usage_metadata return { "level": level, "thinking_tokens": usage.thoughts_token_count or 0, "output_tokens": usage.candidates_token_count or 0, "text": response.text, }# measure the same prompt at two levelsfor lv in ("low", "high"): r = measure("Normalize the variable names in this function to camelCase", lv) print(lv, r["thinking_tokens"], r["output_tokens"])
Averaging repeated runs in my environment, I landed roughly here (treat these as reference points — they shift with input and model updates):
For a mechanical formatting task (normalizing variable names), low averaged about 110 thinking tokens and high about 1,650 — a 15x gap, yet the emitted code was identical. Using high on a task that needs no thinking is almost pure waste.
For a mid-weight refactor (a rewrite that splits responsibilities), low was about 340 and high about 1,900. Here the output diverged. low produced a coarser split, and one in two runs needed a manual fix. high landed close to intent on the first pass.
For a task with a design judgment (where to draw the boundary of a data-access layer), low stayed surface-level, while high spent about 2,400 tokens and came back referencing the preconditions. Here the extra spend is clearly worth it.
The conclusion is simple: the spend-versus-accuracy tradeoff is mostly decided by task type, and a one-size-fits-all level loses in both directions.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Measured thinking-token gap between low and high (roughly 12–15x in my setup)
✦A Python router that auto-assigns thinking_level across three task tiers
✦Production pitfalls I hit (missing thought_signatures, low-level misses) and how I worked around them
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Split Tasks Into Three Tiers and Assign Level Automatically
Picking the level by hand every time doesn't last. So I sorted my tasks into three tiers and let a router assign the level. The axis is "how expensive is the rework if the model gets it wrong."
Tier 1: Cheap to Miss (fixed low)
Formatting, renaming, boilerplate generation, adding comments. Things I can eyeball on the spot and catch in a second if wrong. Speed dominates the experience here, so I keep thinking minimal.
Tier 2: Annoying to Fix by Hand (middle)
Small refactors across multiple files, adding tests, tidying type definitions. The zone where low needs a fix half the time and high is overkill. I set an explicit middle thinking_budget (around 512 tokens) to dampen the swing.
Tier 3: Expensive to Walk Back (fixed high)
Architectural boundaries, state-management decisions, changes touching data integrity. Skimp on thinking tokens here and implementation proceeds on a wrong premise — burning dozens of times more in the end.
If you use Antigravity's editor alone, you don't even need the SDK — just nudging "the default for light operations" toward low in the workflow settings already helps. The SDK matters when the task type is known in advance, like scheduled runs or an auto-publishing pipeline. In my case that's the larger share, so I lean on the router approach.
Pitfalls I Hit in Production
Two things stayed invisible during measurement and only surfaced once it was running.
First, mishandling thought_signatures in a multi-turn agent loop breaks the continuity of reasoning on later turns. In the Gemini 3 line, thinking state returns as a signature you're meant to feed back into the next request. Writing a router that switches level dynamically means rebuilding the config object each turn, and it's easy to forget to carry the signature through. The symptom is agent accuracy drifting down slowly, and it took me a while to trace. For sessions that continue a conversation, I now stack the full response (signature included) into history and swap only the config.
Second, pushing Tier 1 too far toward low occasionally produced a "fix one spot when asked to fix several" miss. Even in formatting, minimal thinking starts to matter when the targets are scattered widely. As a fix, I added a branch: if a Tier 1 task is judged to "span three or more files," that run alone is promoted to the Tier 2 budget. One exception that bumps a fixed profile up a notch by input size was enough to stabilize it.
Editor Alone vs. SDK — Where to Put the Routing
When you're working interactively in Antigravity's editor, switching level per operation isn't practical. Here I keep it to a single move — nudging "the default for light operations" toward low in the workflow settings — and bump up to high by hand only where I hesitate.
For work where the task type is fixed ahead of time, like scheduled runs or auto-publishing, a router assigning it mechanically is steadier. I lean all my recurring jobs toward the latter, and in interactive sessions I barely think about routing at all.
Same thinking_level, but whether that decision lives on the editor side or in code changes how the whole thing feels to operate. If you touch things by hand a lot, start from the editor default; if you automate heavily, start from the router. Either entry point eases you in without friction.
The Setup I Settled On
In the end, nudging Tier 1 — which dominated my daily agent runs — toward low brought total thinking tokens down by roughly 40% by feel. Meanwhile, protecting Tier 3 with high actually reduced design-level rework. My total bill went down while the felt quality went up — a rare both-ways win.
My rule when I'm unsure: "if this output were wrong, how many seconds until I'd notice?" If I'd catch it in seconds, low; if it takes minutes and the rework is heavy, high. Keeping that one sentence in the back of my mind makes the level choice almost mechanical.
The numbers shift with every model update, so don't take the values here on faith — pick three of your own representative tasks and measure them at low and high first. Fifteen minutes of measurement should keep your monthly spend lighter for a long time afterward. Thanks for reading.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.