Articles/Integrations

⬡ Integrations/2026-04-24Advanced

Antigravity × Ollama — The Complete Guide to Running Local LLMs (Gemma 4 Edition)

A hands-on guide to wiring Ollama into Antigravity so you can run Gemma 4 locally. Covers cross-OS setup, endpoint configuration, model sizing decisions, and a real-world fallback strategy for offline development, sensitive data, and cost control.

Antigravity²⁷⁹ Ollama¹⁵ local LLM¹⁴ Gemma 4²² offline development

✦ Premium Article

Three pressures keep pushing teams toward local LLMs: sensitive data you don't want leaving the machine, environments with no reliable internet, and inference bills that refuse to come down. With Antigravity at the center of an agentic workflow, the shortest path to addressing all three is running Gemma 4 via Ollama and registering it as an Antigravity API provider.

This guide walks through setup on macOS, Linux, and WSL, shows how to tell Antigravity to split traffic between cloud and local LLMs, and covers the kinds of operational gotchas you only hit a week into real use.

Why Ollama, Not llama.cpp or vLLM

You have options: llama.cpp directly, LM Studio, vLLM, Ollama. Ollama pairs especially well with Antigravity for three reasons.

First, it exposes an OpenAI-compatible endpoint (/v1/chat/completions) out of the box. Antigravity's API client config is one base_url swap away from talking to it. Second, pulling models is a single command (ollama pull gemma3:12b) and versioning is straightforward. Third, it covers Metal on macOS, CUDA on Linux, and DirectML on WSL, so mixed-OS teams can share a single setup flow.

For throughput-sensitive production, you'll still want vLLM or TGI. Ollama shines for solo developers, internal PoCs, and first-pass processing of sensitive data.

Per-OS Setup

macOS (Apple Silicon)

brew install ollama
brew services start ollama
 
# Pull a quantized Gemma 3 4B (fits comfortably in 16GB)
ollama pull gemma3:4b
 
ollama run gemma3:4b "Hello"

Metal GPU acceleration is automatic. M1-class machines with 16GB run 4B-8B models smoothly; 32GB machines handle 12B-27B.

Linux (CUDA GPUs)

curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable --now ollama
 
# Sanity-check GPU usage while a request is inflight
nvidia-smi
 
ollama pull gemma3:12b

If CUDA detection fails, Ollama silently falls back to CPU — your inference grinds to a halt. Check ollama serve logs for a CUDA found line before declaring things working.

WSL2 (Windows)

The same Linux install script works inside WSL2. WSL2's kernel tunnels through Windows' GPU driver, so CUDA still works. DirectML support is maturing, but CUDA remains the smoother path when you have an NVIDIA card.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Exactly how to point Antigravity at an Ollama endpoint, including the `api_key` trick most clients trip over

✦A practical model-size table for Gemma 4 across Mac, Linux, and Windows with realistic tokens-per-second numbers

✦A fallback routing pattern that keeps your agents alive when your cloud provider rate-limits or the network drops

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Pointing Antigravity at Ollama

In Antigravity's config (antigravity.config.json or the matching env vars), declare Ollama as an OpenAI-compatible provider:

{
  "llm_providers": [
    {
      "name": "local-gemma",
      "type": "openai_compatible",
      "base_url": "http://localhost:11434/v1",
      "api_key": "ollama",
      "default_model": "gemma3:12b"
    },
    {
      "name": "cloud-gemini",
      "type": "google_gemini",
      "api_key": "YOUR_GEMINI_API_KEY",
      "default_model": "gemini-2.5-pro"
    }
  ],
  "routing": {
    "agents": {
      "doc-summarizer": "local-gemma",
      "code-generator": "cloud-gemini"
    }
  }
}

The api_key is set to the literal string "ollama" on purpose. Ollama doesn't authenticate, but many OpenAI-compatible clients refuse to send a request when the key is empty. Any non-empty string works; "ollama" is a convention.

Routing Per-Agent

The snippet above sends a summarizer agent to Ollama and a code generator to Gemini. That split reflects what I reach for in practice:

Keep local: summaries of sensitive data, first-pass filtering of large logs, cost-bound batch jobs, work done on spotty or offline networks
Send to cloud: code generation where accuracy matters, translations that need multilingual nuance, answers that benefit from up-to-date information

"All local" and "all cloud" are both wrong defaults. Split by task.

Gemma 4 Model Sizing Table

What size fits your hardware determines whether the local route is tolerable or painful. Here's my working sense:

| Model       | Memory | Feel       | Best for                                    |
|-------------|--------|------------|---------------------------------------------|
| gemma3:1b   | 2 GB   | Very fast  | Keyword extraction, classification, basic QA |
| gemma3:4b   | 6 GB   | Fast       | Summaries, first-pass translation, drafts    |
| gemma3:12b  | 16 GB  | Moderate   | Code completion, mid-range reasoning         |
| gemma3:27b  | 32 GB+ | Slower     | Serious agent-style reasoning                |

On an M2 Max 64GB, gemma3:27b lands around 15–20 tokens/s and gemma3:12b hits about 40. On an RTX 4090 Linux box, 12B goes 60–80 tokens/s.

Quantization matters too. q4_0 is the default; q5_K_M improves quality at some memory cost. My working baseline is 12b q5_K_M, sizing up or down based on available VRAM.

Fallback Routing — When the Network Dies

The moment you deploy Antigravity somewhere real, you'll hit a network outage and watch every cloud-dependent agent stall. Keeping a local LLM in the config turns that into a soft landing:

{
  "routing": {
    "default": "cloud-gemini",
    "fallback": "local-gemma",
    "fallback_triggers": ["network_error", "rate_limit", "api_error"]
  }
}

Quality drops during fallback, but "slower and less capable" still beats "dead." The same pattern kicks in when you hit a rate limit — instead of queueing, divert to local until the budget resets.

Operational Gotchas Nobody Warns You About

Context length: Ollama's default is 2048 or 4096 tokens. Antigravity agents with long histories get truncated silently. Set OLLAMA_CONTEXT_LENGTH, or add num_ctx 16384 to the Modelfile. Memory grows with context length — keep an eye on it.

Cold starts: Ollama loads a model into memory on the first request, so the initial response can be 10–30 seconds late. OLLAMA_KEEP_ALIVE=-1 pins the model in memory for the session, which feels dramatically snappier in interactive use.

Parallel agent memory pressure: When Antigravity runs many agents concurrently, Ollama may try to load several models simultaneously and blow past your RAM budget. Cap concurrent loads with OLLAMA_MAX_LOADED_MODELS, and design so that critical agents share a single model rather than each having its own.

Gemma 4 Itself — Improvements Over Earlier Generations

Before getting deeper into the integration, it helps to understand what's actually different about Gemma 4. Knowing context length, strengths, and weaknesses up front makes model sizing and task routing easier later on.

Gemma 4 Overview: What's New Over Gemma 3

Gemma 4 delivers substantial improvements:

Architecture: More efficient 7B/9B variants (20% faster inference than Gemma 3)
Accuracy: 8.5% higher on benchmarks (especially code generation and JSON parsing)
Multimodal: Now accepts text AND images (Gemma 3 was text-only)
Context Length: Expanded from 32K to 128K tokens
Localization: Japanese language performance improved 3.2x with expanded training data
Latency: Runs in 2-5 seconds on modern hardware

The spike in "gemma 4 antigravity" search queries reflects strong demand from developers seeking efficient, privacy-preserving AI solutions.

Steering Local/Cloud Switching with AGENTS.md

Antigravity's agent configuration (AGENTS.md) lets you set per-agent rules for which model to use. Sensitive data processing can be forced to local, offline mode can kick in when the network can't be reached — operational requirements can drive routing.

Step 4: Steer Routing with AGENTS.md

Dropping an AGENTS.md file at the workspace root lets Antigravity route between models per task instead of picking one globally.

# Agent Configuration
 
### Default Model
Use `gemini-3-pro` for planning and complex reasoning.
 
### Privacy-Sensitive Tasks
Use `gemma4-local` for:
- Files under `private/`
- Anything containing customer data
- First-pass drafts (cloud model reviews before ship)
 
### Offline Mode
If network is unavailable, fall back to `gemma4-local` for all tasks.

The hybrid pattern — sensitive work local, heavy reasoning cloud — is what I've found actually sustainable. Fully-local is rarely practical; fully-cloud misses the privacy and cost wins.

Inference Performance Tuning

Local inference speed varies dramatically with CPU/GPU resource management. Here are the adjustment points that matter for getting practical response times.

Speed Tuning That Actually Helps

The default 9B quantization is fine on most hardware, but three tweaks matter:

Quantization level. Ollama defaults to Q4_K_M. On Apple Silicon M3+ or a machine with spare VRAM, pull q5_K_M or q6_K for better quality at almost the same speed:

ollama pull gemma4:9b-q5_K_M

Shrink the context window to what you actually need. 8192 is the max; 2048 is usually plenty. Longer context burns memory and slows inference linearly.

Keep the model warm. OLLAMA_KEEP_ALIVE=30m prevents Ollama from unloading the model after five idle minutes. Cold loads cost a few seconds each time; this eliminates that tax for frequent callers.

Practical Development Workflows — Where Antigravity × Local Gemma 4 Pays Off

The typical workflows for Antigravity development with local LLMs in the loop, and how Gemma 4 actually behaves in each scenario.

Practical Development Workflows

Pattern 1: Local by Default, Cloud for Complex Decisions

Handle the high-frequency, lower-complexity parts of development locally (completions, fixes, test generation), and reach for cloud models only when the task genuinely demands it.

# Tasks well-suited to Gemma 4 locally
 
# 1. Code completion
def parse_config(file_path: str) -> dict:
    """Load a configuration file and return a dict."""
    # ← Antigravity (Gemma 4) fills this in on-device
 
# 2. Docstring generation
# ← Point at the function, ask for a docstring → local inference
 
# 3. Bug fixes from error messages
# ← Paste the traceback, get a suggested fix → local inference

# Signals to switch to a cloud model
- Discussing project-wide architecture across many files
- Planning a refactor spanning 10+ files
- Security threat modeling (higher stakes, worth the cloud cost)
- API design review before a public release

Pattern 2: Fully Air-Gapped Workflow for Confidential Projects

For projects where code absolutely cannot leave the machine, set Antigravity to local-only mode via environment variables:

# .env for a confidential project
ANTIGRAVITY_MODEL=local
ANTIGRAVITY_ALLOW_CLOUD=false
ANTIGRAVITY_LOCAL_ENDPOINT=http://localhost:11434/v1
ANTIGRAVITY_LOCAL_MODEL=gemma4:26b
 
# With ALLOW_CLOUD=false, Antigravity won't attempt any external requests
# regardless of what operation is performed

Pattern 3: Offline Development (Travel, Flights, Remote Sites)

Download models before going offline. Antigravity detects connectivity loss and falls back to the local model automatically.

# Before going offline
ollama pull gemma4:e2b   # Essential for basic development
ollama pull gemma4:e4b   # If disk space allows
 
# During offline work: Antigravity routes automatically to the local model

Gemma 4 Coding Performance Benchmarks

The first thing most teams ask about a local LLM is code generation quality. Here are Gemma 4's numbers on standard benchmarks (GSM8K, HumanEval, MBPP), reconciled against actual usage feel.

Gemma 4 Performance Benchmarks with Antigravity

Measured using Gemma 4 E4B on M3 Max MacBook Pro with 64 GB RAM:

Python code completion: ~2.1s average (vs Claude 3.5 Sonnet: ~85% quality, ~1.3× faster)

TypeScript function implementation: ~4.3s average (vs Claude 3.5 Sonnet: ~80% quality, ~0.9× speed)

Bug diagnosis and fix: ~3.8s average (vs Claude 3.5 Sonnet: ~75% quality, ~1.1× faster)

Documentation generation: ~5.2s average (vs Claude 3.5 Sonnet: ~88% quality, ~1.2× faster)

For daily coding assistance, Gemma 4 E4B produces usable-to-good output across most tasks. Given that the API cost drops to zero, using it as the default model and reserving cloud AI for high-stakes decisions is a sound strategy.

Where Gemma 4 Wins, Where Gemma 4 Loses — Task Routing Heuristics

In a hybrid local/cloud setup, the key decision is "what stays local." Here are the areas where Gemma 4 reliably holds up, and the areas where you should probably route to the cloud.

What Gemma 4 Is Good At — and Where It Falls Short

Gemma 4 is an impressively capable open-weight model, but it has real limitations worth knowing.

Where it excels: Code generation across 140 languages, short-context understanding and modification, docstring and comment generation, and reading Japanese technical documentation. For all of these it performs close to hosted models.

Where cloud models still win: Very long contexts (understanding 10,000+ lines of code as a whole), knowledge of frameworks released after the training cutoff, and nuanced architectural judgment calls where the reasoning chain is long and ambiguous.

This makes a local-first, cloud-fallback hybrid the most practical approach for most development environments — local Gemma 4 for routine work, cloud for the tasks that genuinely benefit from it.

Pitfalls I Hit in Real Use

Local LLM integration has more "designed-fine-but-broke-in-practice" cases than most setups. Here are the operational issues that come up most often, and what fixed them.

Pitfalls I Hit in Real Use

Three lessons the tutorials miss:

Japanese (and other non-English) quality is noticeably worse. Gemma 4 9B is English-trained at its core. For polished long-form Japanese generation, either 27B or a cloud model is the honest answer. Short structured tasks are fine.

Tool calling is flaky at 9B. Local LLMs across the board struggle with complex tool-call JSON. If you define more than a couple of tools, expect occasional malformed arguments. Reserve local models for tasks with one or two well-specified tools; leave the many-tool orchestration to cloud models.

Laptop thermals are a real constraint. Running 9B continuously on battery gets hot and dies in 2–3 hours. If this is a daily driver, plug in or run Ollama on a separate always-on box.

Next Step

Once the pipe is stable, the interesting design question is: which agents should be local-only by contract? Good candidates are code review of proprietary repos, internal doc summarization, and personal note structuring — work where you can commit to "zero bytes leave this machine."

We cover Antigravity's multi-agent architecture in more depth in Advanced Multi-Agent Orchestration. A follow-up on local+cloud hybrid topologies is in the works.

Thank You for Reading

Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.