When an Overnight Local Agent Crawls by Dawn — Keeping Ollama's Latency Flat by Working Backward from Context Length
Why each step of a long-running local agent gets heavier toward the end, how to measure it from Ollama's timing fields, and how a fixed num_ctx plus a rolling summary keep per-step latency flat.
As an indie developer, I run a local agent overnight to tidy up the store metadata for the apps I maintain. It runs Gemma 4 through Ollama, entirely on my own machine. Nothing leaves the box, nothing costs money, and it quietly works through the backlog during off-peak hours. For a while it felt like a dependable companion.
Then one morning I stopped scrolling the log.
The first few steps had returned in a few hundred milliseconds, but each step grew heavier toward the end, and by dawn I was waiting several seconds per turn. The same number of records. The same model. The only thing that had changed was the length of the context piling up in the conversation.
The longer you intend to run an unattended agent, the more this "heavier toward the end" behavior trips you up. Here is how I traced it to its real cause and flattened the per-step latency.
Why the dawn step is heavier than the first one
A single local-LLM response splits into two costs: the prompt-eval time, where the model reads the whole prompt and builds its internal state, and the generation time, where it emits new tokens one at a time.
Generation speed is set mostly by how long the output is. Prompt-eval, on the other hand, scales with the length of the prompt you hand over. As conversation history and tool output pile up, every prompt-eval gets heavier.
An unattended agent resubmits "all history so far plus the new instruction" on every step. So a late step is re-reading many times the context of an early one, every single time. Even when the output length is identical, the wait time swells in silence. That was the source of the dawn seconds.
Splitting the latency into prompt-eval and generation
Fortunately, Ollama returns a timing breakdown with each response. Look at prompt_eval_count (tokens read), prompt_eval_duration (time spent on them), eval_count (tokens generated), and eval_duration (generation time), and you can tell which half the latency lives in.
The units are nanoseconds. Convert to milliseconds and work with those.
Keeping wall_ms (the measured round-trip) in the return value helps you notice anything the breakdown misses, such as model load time.
Record every step
Splitting once tells you nothing. Log the breakdown on every step of the loop, then read context length and prompt-eval growth side by side.
import json, pathlibLOG = pathlib.Path("agent_timing.jsonl")def run_step(model, history, user_msg, step): history.append({"role": "user", "content": user_msg}) m = chat_once(model, history) history.append({"role": "assistant", "content": m["text"]}) record = {"step": step, **{k: m[k] for k in ("prompt_tokens", "prompt_ms", "gen_tokens", "gen_ms", "wall_ms")}} with LOG.open("a") as f: f.write(json.dumps(record, ensure_ascii=False) + "\n") return m["text"]
Pile history up naively like this, and the log shows prompt_tokens climbing step by step, dragging prompt_ms up with it.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Instrumentation that splits each step's latency into prompt-eval vs. generation using Ollama's prompt_eval_duration and eval_count
✦A measured curve where prompt-eval roughly 3x as context grows 2k to 12k tokens, and how to pin num_ctx on purpose
✦A BoundedContext that keeps the last N turns, folds older ones into a rolling summary, and caps tool output to flatten per-step latency by about 45%
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Here is a measurement from my setup (Gemma 4, 12B quantized, an Apple Silicon laptop), varying context length against the same amount of generation. The numbers shift with the environment, but the shape of the slope held.
Prompt tokens
prompt-eval P50
prompt-eval P90
generation (held constant)
~2,000
0.31 s
0.38 s
~1.1 s
~6,000
0.74 s
0.92 s
~1.1 s
~12,000
1.02 s
1.34 s
~1.1 s
Generation time is set by output volume, so it stays flat. Prompt-eval, by contrast, swelled roughly 3x as context grew from 2k to 12k. The whole reason late steps are heavy lives in that one column.
What matters here is that I confirmed with numbers that the culprit was prompt-eval, not generation. Swapping to a lighter model to speed up generation would not solve this. The lever is on the "context re-read every time" side.
Don't leave num_ctx to the default
Ollama's num_ctx (the context window) often ships with a modest default, and that is the first pitfall. Tokens past the window are not raised as an error — they are silently dropped. When an agent suddenly behaves as if it forgot "what we just decided," suspect this silent truncation.
Going the other way and making the window far larger than needed is also wasteful. Ollama reserves internal buffers for the whole window, which pressures memory and spends time reading empty context. Pushing it to extremes can crash on out-of-memory (I wrote up how to isolate that symptom in isolating Gemma 4 out-of-memory under Ollama).
I measure the peak context that actually accumulates in real work, then pin the window a little above that. To avoid forgetting the option on some call, bake it into a Modelfile.
# Pin a window that fits the work, avoiding the default's silent truncation.cat > Modelfile <<'MF'FROM gemma4:12bPARAMETER num_ctx 8192PARAMETER temperature 0.2MFollama create gemma4-agent -f Modelfile
Once the window is pinned, the next decision is what to keep inside it.
Keeping the working context bounded — last N turns plus a rolling summary
Even with a wide window, prompt-eval grows again if context grows without bound. The real fix is to keep the context the agent reads each step bounded.
Here is how I structure it. Keep the last N turns verbatim, fold anything older into a one-paragraph summary, and cap tool output by character count since it tends to run long. When prompt_eval_count crosses a soft cap, move the older turns into the summary. Wrap all three in one object.
class BoundedContext: """Last keep_turns turns stay verbatim; older ones fold into a summary. Tool output is capped at tool_cap chars to curb context growth.""" def __init__(self, system, summarize, keep_turns=6, tool_cap=1200, soft_tokens=6000): self.system = {"role": "system", "content": system} self.summarize = summarize # folds old history into one paragraph self.keep_turns = keep_turns self.tool_cap = tool_cap self.soft_tokens = soft_tokens self.summary = "" self.turns = [] # [{role, content}, ...] def add(self, role, content): if role == "tool" and len(content) > self.tool_cap: content = content[: self.tool_cap] + "\n...(truncated)" self.turns.append({"role": role, "content": content}) def fold_if_needed(self, last_prompt_tokens): # Move older turns into the summary once the soft cap is crossed. if last_prompt_tokens <= self.soft_tokens or len(self.turns) <= self.keep_turns: return old, self.turns = self.turns[:-self.keep_turns], self.turns[-self.keep_turns:] self.summary = self.summarize(self.summary, old) def messages(self): msgs = [self.system] if self.summary: msgs.append({"role": "system", "content": "Summary of the story so far:\n" + self.summary}) msgs.extend(self.turns) return msgs
The summarize function is itself an inference, so calling it every step does not pay off. By folding only when the soft cap is crossed, as fold_if_needed does, you spread the summary cost across several steps.
Capping tool output is humble but effective. When a full listing or log lands in history verbatim, a single step eats thousands of tokens. Keeping just the head and cutting the rest curbed the context growth considerably.
The effect — flattening per-step latency
Here is the same overnight batch run through the naive history-piling version and the BoundedContext version.
Measurement
Naive piling
BoundedContext
Early step (prompt-eval)
0.31 s
0.33 s
Late step (prompt-eval)
1.02 s
0.46 s
Per-step median
0.71 s
0.39 s
120-step batch total
~21 min
~12 min
The early steps are roughly even. The gap opens at the end, where the prompt-eval median came down by about 45%. In wall-clock terms, a batch that used to overflow the off-peak window now finishes before morning with room to spare.
More valuable than the numbers was that the latency became predictable. When a step's cost stays flat even late in the run, you can estimate up front how many steps take how many minutes. For unattended runs, that predictability is its own reassurance.
What I watch in production
Folding into a summary loses detail, of course. If a later step needs the exact figures or identifiers from an old turn, it can misread them after folding. I keep these "facts I will need precisely later" out of the summary and in a small ledger on the system-message side instead.
The other trap is shrinking num_ctx too far. A smaller window makes prompt-eval lighter, but if even the last N turns no longer fit, the agent drops the most recent instruction. Set the window's floor to whatever reliably holds "last N turns plus summary plus system."
For the broader setup of putting a local LLM under an agent, I wrote up wiring Ollama's local LLM into Antigravity. Read together, the line from measurement to operation connects.
Where to start
If you have a loop that feels heavy, try just these three steps first.
Drop in chat_once and log each step's prompt_tokens and prompt_ms to JSONL
Read the log and split the latency into prompt-eval versus generation
If prompt-eval is the culprit, pin num_ctx to the measured value and fold everything but the last N turns into a summary
Touching the window or the model without measuring usually means the long way around. I burned a few nights trying to lighten the model before I finally landed on this one column. I hope it saves you a night.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.