ANTIGRAVITY LABJP
Articles/AI Tools
AI Tools/2026-06-24Advanced

When an Overnight Local Agent Crawls by Dawn — Keeping Ollama's Latency Flat by Working Backward from Context Length

Why each step of a long-running local agent gets heavier toward the end, how to measure it from Ollama's timing fields, and how a fixed num_ctx plus a rolling summary keep per-step latency flat.

ollama7gemma412local-llm17latency2ai-tools14

Premium Article

As an indie developer, I run a local agent overnight to tidy up the store metadata for the apps I maintain. It runs Gemma 4 through Ollama, entirely on my own machine. Nothing leaves the box, nothing costs money, and it quietly works through the backlog during off-peak hours. For a while it felt like a dependable companion.

Then one morning I stopped scrolling the log.

The first few steps had returned in a few hundred milliseconds, but each step grew heavier toward the end, and by dawn I was waiting several seconds per turn. The same number of records. The same model. The only thing that had changed was the length of the context piling up in the conversation.

The longer you intend to run an unattended agent, the more this "heavier toward the end" behavior trips you up. Here is how I traced it to its real cause and flattened the per-step latency.

Why the dawn step is heavier than the first one

A single local-LLM response splits into two costs: the prompt-eval time, where the model reads the whole prompt and builds its internal state, and the generation time, where it emits new tokens one at a time.

Generation speed is set mostly by how long the output is. Prompt-eval, on the other hand, scales with the length of the prompt you hand over. As conversation history and tool output pile up, every prompt-eval gets heavier.

An unattended agent resubmits "all history so far plus the new instruction" on every step. So a late step is re-reading many times the context of an early one, every single time. Even when the output length is identical, the wait time swells in silence. That was the source of the dawn seconds.

Splitting the latency into prompt-eval and generation

Fortunately, Ollama returns a timing breakdown with each response. Look at prompt_eval_count (tokens read), prompt_eval_duration (time spent on them), eval_count (tokens generated), and eval_duration (generation time), and you can tell which half the latency lives in.

The units are nanoseconds. Convert to milliseconds and work with those.

import time, requests
 
OLLAMA = "http://localhost:11434/api/chat"
 
def chat_once(model, messages, num_ctx=8192):
    t0 = time.perf_counter()
    r = requests.post(OLLAMA, json={
        "model": model,
        "messages": messages,
        "stream": False,
        "options": {"num_ctx": num_ctx, "temperature": 0.2},
    }, timeout=600)
    r.raise_for_status()
    d = r.json()
    wall_ms = (time.perf_counter() - t0) * 1000
    prompt_tokens = d.get("prompt_eval_count", 0)
    prompt_ms = d.get("prompt_eval_duration", 0) / 1e6   # ns -> ms
    gen_tokens = d.get("eval_count", 0)
    gen_ms = d.get("eval_duration", 0) / 1e6
    return {
        "text": d["message"]["content"],
        "prompt_tokens": prompt_tokens,
        "prompt_ms": round(prompt_ms, 1),
        "gen_tokens": gen_tokens,
        "gen_ms": round(gen_ms, 1),
        "wall_ms": round(wall_ms, 1),
        "tok_per_s": round(gen_tokens / (gen_ms / 1000), 1) if gen_ms else 0.0,
    }

Keeping wall_ms (the measured round-trip) in the return value helps you notice anything the breakdown misses, such as model load time.

Record every step

Splitting once tells you nothing. Log the breakdown on every step of the loop, then read context length and prompt-eval growth side by side.

import json, pathlib
 
LOG = pathlib.Path("agent_timing.jsonl")
 
def run_step(model, history, user_msg, step):
    history.append({"role": "user", "content": user_msg})
    m = chat_once(model, history)
    history.append({"role": "assistant", "content": m["text"]})
    record = {"step": step, **{k: m[k] for k in
              ("prompt_tokens", "prompt_ms", "gen_tokens", "gen_ms", "wall_ms")}}
    with LOG.open("a") as f:
        f.write(json.dumps(record, ensure_ascii=False) + "\n")
    return m["text"]

Pile history up naively like this, and the log shows prompt_tokens climbing step by step, dragging prompt_ms up with it.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Instrumentation that splits each step's latency into prompt-eval vs. generation using Ollama's prompt_eval_duration and eval_count
A measured curve where prompt-eval roughly 3x as context grows 2k to 12k tokens, and how to pin num_ctx on purpose
A BoundedContext that keeps the last N turns, folds older ones into a rolling summary, and caps tool output to flatten per-step latency by about 45%
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

AI Tools2026-04-14
Gemma 4 Implicit Caching in Antigravity: Cut Your Credit Costs by 40% Without Changing a Line of Code
A practical guide to leveraging Gemma 4's Implicit Caching in Antigravity. Learn how to structure your projects to dramatically reduce credit consumption when working with large codebases.
AI Tools2026-04-10
Fine-Tuning Gemma 4 with Antigravity: A Practical Guide to Building Custom AI Models
Learn how to fine-tune Gemma 4 using LoRA/QLoRA and integrate your custom model into Antigravity. From dataset preparation to local deployment, this step-by-step guide covers everything with code examples.
Tips2026-05-08
How to Fix Out of Memory Errors When Using Gemma 4 in Antigravity
Getting out of memory errors when running Gemma 4 in Antigravity? This guide covers how to diagnose the issue and fix it—from switching to quantized models to tuning Ollama settings—based on real troubleshooting experience.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →