⚙ AI Tools/2026-06-24Advanced

When an Overnight Local Agent Crawls by Dawn — Keeping Ollama's Latency Flat by Working Backward from Context Length

Why each step of a long-running local agent gets heavier toward the end, how to measure it from Ollama's timing fields, and how a fixed num_ctx plus a rolling summary keep per-step latency flat.

ollama⁷ gemma4¹² local-llm¹⁷ latency² ai-tools¹⁴

✦ Premium Article

As an indie developer, I run a local agent overnight to tidy up the store metadata for the apps I maintain. It runs Gemma 4 through Ollama, entirely on my own machine. Nothing leaves the box, nothing costs money, and it quietly works through the backlog during off-peak hours. For a while it felt like a dependable companion.

Then one morning I stopped scrolling the log.

The first few steps had returned in a few hundred milliseconds, but each step grew heavier toward the end, and by dawn I was waiting several seconds per turn. The same number of records. The same model. The only thing that had changed was the length of the context piling up in the conversation.

The longer you intend to run an unattended agent, the more this "heavier toward the end" behavior trips you up. Here is how I traced it to its real cause and flattened the per-step latency.

Why the dawn step is heavier than the first one

A single local-LLM response splits into two costs: the prompt-eval time, where the model reads the whole prompt and builds its internal state, and the generation time, where it emits new tokens one at a time.

Generation speed is set mostly by how long the output is. Prompt-eval, on the other hand, scales with the length of the prompt you hand over. As conversation history and tool output pile up, every prompt-eval gets heavier.

An unattended agent resubmits "all history so far plus the new instruction" on every step. So a late step is re-reading many times the context of an early one, every single time. Even when the output length is identical, the wait time swells in silence. That was the source of the dawn seconds.

Splitting the latency into prompt-eval and generation

Fortunately, Ollama returns a timing breakdown with each response. Look at prompt_eval_count (tokens read), prompt_eval_duration (time spent on them), eval_count (tokens generated), and eval_duration (generation time), and you can tell which half the latency lives in.

The units are nanoseconds. Convert to milliseconds and work with those.

import time, requests
 
OLLAMA = "http://localhost:11434/api/chat"
 
def chat_once(model, messages, num_ctx=8192):
    t0 = time.perf_counter()
    r = requests.post(OLLAMA, json={
        "model": model,
        "messages": messages,
        "stream": False,
        "options": {"num_ctx": num_ctx, "temperature": 0.2},
    }, timeout=600)
    r.raise_for_status()
    d = r.json()
    wall_ms = (time.perf_counter() - t0) * 1000
    prompt_tokens = d.get("prompt_eval_count", 0)
    prompt_ms = d.get("prompt_eval_duration", 0) / 1e6   # ns -> ms
    gen_tokens = d.get("eval_count", 0)
    gen_ms = d.get("eval_duration", 0) / 1e6
    return {
        "text": d["message"]["content"],
        "prompt_tokens": prompt_tokens,
        "prompt_ms": round(prompt_ms, 1),
        "gen_tokens": gen_tokens,
        "gen_ms": round(gen_ms, 1),
        "wall_ms": round(wall_ms, 1),
        "tok_per_s": round(gen_tokens / (gen_ms / 1000), 1) if gen_ms else 0.0,
    }

Keeping wall_ms (the measured round-trip) in the return value helps you notice anything the breakdown misses, such as model load time.

Record every step

Splitting once tells you nothing. Log the breakdown on every step of the loop, then read context length and prompt-eval growth side by side.

import json, pathlib
 
LOG = pathlib.Path("agent_timing.jsonl")
 
def run_step(model, history, user_msg, step):
    history.append({"role": "user", "content": user_msg})
    m = chat_once(model, history)
    history.append({"role": "assistant", "content": m["text"]})
    record = {"step": step, **{k: m[k] for k in
              ("prompt_tokens", "prompt_ms", "gen_tokens", "gen_ms", "wall_ms")}}
    with LOG.open("a") as f:
        f.write(json.dumps(record, ensure_ascii=False) + "\n")
    return m["text"]

Pile history up naively like this, and the log shows prompt_tokens climbing step by step, dragging prompt_ms up with it.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Instrumentation that splits each step's latency into prompt-eval vs. generation using Ollama's prompt_eval_duration and eval_count

✦A measured curve where prompt-eval roughly 3x as context grows 2k to 12k tokens, and how to pin num_ctx on purpose

✦A BoundedContext that keeps the last N turns, folds older ones into a rolling summary, and caps tool output to flatten per-step latency by about 45%

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Measuring how context length drives prompt-eval

Here is a measurement from my setup (Gemma 4, 12B quantized, an Apple Silicon laptop), varying context length against the same amount of generation. The numbers shift with the environment, but the shape of the slope held.

Prompt tokens	prompt-eval P50	prompt-eval P90	generation (held constant)
~2,000	0.31 s	0.38 s	~1.1 s
~6,000	0.74 s	0.92 s	~1.1 s
~12,000	1.02 s	1.34 s	~1.1 s

Generation time is set by output volume, so it stays flat. Prompt-eval, by contrast, swelled roughly 3x as context grew from 2k to 12k. The whole reason late steps are heavy lives in that one column.

What matters here is that I confirmed with numbers that the culprit was prompt-eval, not generation. Swapping to a lighter model to speed up generation would not solve this. The lever is on the "context re-read every time" side.

Don't leave num_ctx to the default

Ollama's num_ctx (the context window) often ships with a modest default, and that is the first pitfall. Tokens past the window are not raised as an error — they are silently dropped. When an agent suddenly behaves as if it forgot "what we just decided," suspect this silent truncation.

Going the other way and making the window far larger than needed is also wasteful. Ollama reserves internal buffers for the whole window, which pressures memory and spends time reading empty context. Pushing it to extremes can crash on out-of-memory (I wrote up how to isolate that symptom in isolating Gemma 4 out-of-memory under Ollama).

I measure the peak context that actually accumulates in real work, then pin the window a little above that. To avoid forgetting the option on some call, bake it into a Modelfile.

# Pin a window that fits the work, avoiding the default's silent truncation.
cat > Modelfile <<'MF'
FROM gemma4:12b
PARAMETER num_ctx 8192
PARAMETER temperature 0.2
MF
ollama create gemma4-agent -f Modelfile

Once the window is pinned, the next decision is what to keep inside it.

Keeping the working context bounded — last N turns plus a rolling summary

Even with a wide window, prompt-eval grows again if context grows without bound. The real fix is to keep the context the agent reads each step bounded.

Here is how I structure it. Keep the last N turns verbatim, fold anything older into a one-paragraph summary, and cap tool output by character count since it tends to run long. When prompt_eval_count crosses a soft cap, move the older turns into the summary. Wrap all three in one object.

class BoundedContext:
    """Last keep_turns turns stay verbatim; older ones fold into a summary.
    Tool output is capped at tool_cap chars to curb context growth."""
    def __init__(self, system, summarize, keep_turns=6, tool_cap=1200, soft_tokens=6000):
        self.system = {"role": "system", "content": system}
        self.summarize = summarize        # folds old history into one paragraph
        self.keep_turns = keep_turns
        self.tool_cap = tool_cap
        self.soft_tokens = soft_tokens
        self.summary = ""
        self.turns = []                   # [{role, content}, ...]
 
    def add(self, role, content):
        if role == "tool" and len(content) > self.tool_cap:
            content = content[: self.tool_cap] + "\n...(truncated)"
        self.turns.append({"role": role, "content": content})
 
    def fold_if_needed(self, last_prompt_tokens):
        # Move older turns into the summary once the soft cap is crossed.
        if last_prompt_tokens <= self.soft_tokens or len(self.turns) <= self.keep_turns:
            return
        old, self.turns = self.turns[:-self.keep_turns], self.turns[-self.keep_turns:]
        self.summary = self.summarize(self.summary, old)
 
    def messages(self):
        msgs = [self.system]
        if self.summary:
            msgs.append({"role": "system",
                         "content": "Summary of the story so far:\n" + self.summary})
        msgs.extend(self.turns)
        return msgs

The summarize function is itself an inference, so calling it every step does not pay off. By folding only when the soft cap is crossed, as fold_if_needed does, you spread the summary cost across several steps.

Capping tool output is humble but effective. When a full listing or log lands in history verbatim, a single step eats thousands of tokens. Keeping just the head and cutting the rest curbed the context growth considerably.

The effect — flattening per-step latency

Here is the same overnight batch run through the naive history-piling version and the BoundedContext version.

Measurement	Naive piling	BoundedContext
Early step (prompt-eval)	0.31 s	0.33 s
Late step (prompt-eval)	1.02 s	0.46 s
Per-step median	0.71 s	0.39 s
120-step batch total	~21 min	~12 min

The early steps are roughly even. The gap opens at the end, where the prompt-eval median came down by about 45%. In wall-clock terms, a batch that used to overflow the off-peak window now finishes before morning with room to spare.

More valuable than the numbers was that the latency became predictable. When a step's cost stays flat even late in the run, you can estimate up front how many steps take how many minutes. For unattended runs, that predictability is its own reassurance.

What I watch in production

Folding into a summary loses detail, of course. If a later step needs the exact figures or identifiers from an old turn, it can misread them after folding. I keep these "facts I will need precisely later" out of the summary and in a small ledger on the system-message side instead.

The other trap is shrinking num_ctx too far. A smaller window makes prompt-eval lighter, but if even the last N turns no longer fit, the agent drops the most recent instruction. Set the window's floor to whatever reliably holds "last N turns plus summary plus system."

For the broader setup of putting a local LLM under an agent, I wrote up wiring Ollama's local LLM into Antigravity. Read together, the line from measurement to operation connects.

Where to start

If you have a loop that feels heavy, try just these three steps first.

Drop in chat_once and log each step's prompt_tokens and prompt_ms to JSONL
Read the log and split the latency into prompt-eval versus generation
If prompt-eval is the culprit, pin num_ctx to the measured value and fold everything but the last N turns into a summary

Touching the window or the model without measuring usually means the long way around. I burned a few nights trying to lighten the model before I finally landed on this one column. I hope it saves you a night.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.