If you have ever burned through your Gemini free tier before lunch, you know the feeling: you want to keep coding, but every request comes back with Quota exceeded. The first time it happened to me, I lost thirty minutes just rewiring my IDE to a local Gemma 4 instance. I have since done that dance more times than I would like to admit, swapping API keys and base URLs every time a provider misbehaves.
LiteLLM exists to absorb that switching cost into a single proxy. Because Antigravity supports custom OpenAI-compatible endpoints, slipping LiteLLM in between lets you say "Claude today," "fall back to Gemma when things get noisy," or "Gemini for code reviews only" without changing a single setting in the IDE. This guide walks through the setup I actually run, the routing strategies that proved worth the YAML, and the production gotchas I wish I had known about earlier.
Why put LiteLLM in front of Antigravity?
LiteLLM is more than a multi-provider client. Three properties make it especially useful next to Antigravity.
- Everything looks like OpenAI: Antigravity's custom provider expects an OpenAI-compatible API. LiteLLM lets you call Gemini, Claude, or Ollama through that same shape.
- Declarative fallback chains: A few lines under
model_listandfallbacksgive you automatic failover on 429s and 5xx responses, with no retry logic on the IDE side. - Cost and latency you can actually see: LiteLLM exposes Prometheus-friendly metrics and request logs, so you can correlate "what did Antigravity do today" with "how much did it cost."
LibreChat covers similar ground but bundles a chat UI, which feels heavy when all you really want is a router. If you only need the proxy, LiteLLM is the more honest fit. (Our self-hosted LibreChat guide covers the alternative if you want both the chat surface and the routing layer.)
The architecture I actually run
Sketched out, the setup looks like this:
- Antigravity (the IDE) speaks HTTPS to LiteLLM Proxy
- LiteLLM Proxy fans out to Gemini, Claude, OpenAI, and Ollama
- The proxy itself runs on Cloud Run or, in my case, a Mac mini that lives under my desk
- Local Gemma 4 sits behind Ollama and is registered in the same LiteLLM
model_list
The important detail is that Antigravity sees only one endpoint. Once you point the IDE at something like http://localhost:4000/v1, it never has to know which provider is responding underneath.
Setting up the LiteLLM proxy
The minimum viable setup is a Docker Compose stack and a small config.yaml.
# config.yaml — minimal LiteLLM proxy with a fallback chain
# "Gemini first, Claude on 429, local Gemma 4 as last resort"
model_list:
- model_name: smart # logical name Antigravity will call
litellm_params:
model: gemini/gemini-2.5-pro
api_key: os.environ/GEMINI_API_KEY
- model_name: smart-backup
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: smart-local
litellm_params:
model: ollama/gemma3:27b
api_base: http://host.docker.internal:11434
router_settings:
fallbacks:
- { smart: ["smart-backup", "smart-local"] }
num_retries: 2
timeout: 30
litellm_settings:
drop_params: true # silently drop params Antigravity sends that the model does not accept
set_verbose: falsedrop_params: true is small but mighty. Antigravity occasionally forwards OpenAI-style fields like frequency_penalty, and Gemini will refuse the entire request when it sees a parameter it does not understand. Dropping them silently is much friendlier than a hard error mid-edit.
# docker-compose.yml — copy-paste ready
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports:
- "4000:4000"
volumes:
- ./config.yaml:/app/config.yaml
environment:
GEMINI_API_KEY: ${GEMINI_API_KEY}
ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY}
command: ["--config", "/app/config.yaml", "--port", "4000"]docker compose up -d starts the proxy. A quick smoke test with curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" http://localhost:4000/v1/models should return a data array containing smart, smart-backup, and smart-local.
Wiring Antigravity through LiteLLM
In Antigravity's settings, add a Custom OpenAI-compatible Provider with the following values:
- Base URL:
http://localhost:4000/v1 - API Key: whatever you set as
LITELLM_MASTER_KEY - Model ID:
smart(the logical name frommodel_list)
That is the entire integration. Type a message in Antigravity's chat panel and LiteLLM handles fallback for you. The first time Gemini gives you a 429 mid-session and your conversation simply continues on Claude, the value of this setup becomes obvious.
If you want local Gemma 4 to be a serious part of your day, pair this with the Antigravity local LLM setup guide. Local inference becomes the safety net for cloud outages instead of a one-off experiment.
Routing strategies — fallback, cost, latency
Once everything works, the next question is "what should I send where?" The honest answer is that no single profile fits every task. Coding sessions, code review, and overnight refactors all have different tolerance for latency, cost, and stylistic quirks. LiteLLM lets you encode that judgment in the proxy itself, so the routing decision is made once and stays consistent across the team.
These are the three profiles I keep coming back to.
- Code review:
claude-sonnet-4-6primary,gemini-2.5-profallback. Long diffs survive Claude's reading better in my experience, and the way it phrases concerns reads as suggestions rather than commands. The fallback to Gemini matters most on Mondays, when Claude tends to be slower under load. - Bulk refactors:
gemini-2.5-proprimary,smart-local(Gemma 4) fallback. The token economics line up with mass edits, and a network blip does not stop the work — Antigravity simply continues against the local model. I have lived through one cross-region outage with this profile and barely noticed. - Personal experiments / nightly batches:
smart-localprimary,gemini-2.0-flashfallback. Effectively free except for electricity. I run weekend prototypes here, and any task that would be embarrassing to put on a corporate bill ends up routed through this profile.
A common mistake is to treat the proxy as a place to also route between fundamentally different model families for the same prompt. In practice the prompt that makes Claude shine often makes Gemma 4 stumble, and vice versa. Profiles should match how you write prompts, not just which models you happen to have keys for. If you find yourself needing radically different prompts per provider, separate them into different logical model names instead of cramming them into one fallback chain.
LiteLLM also supports routing_strategy: latency-based-routing, which picks whichever model is fastest at the moment. I prefer predictable cost, so I do not use it, but if you build agents with strict latency budgets it is worth a look. (Our Antigravity x Ollama integration guide covers latency tuning on the local side in more depth.)
What actually bit me
Some of these are not in the official docs.
- Context length mismatches fail quietly: I sent a Gemini-sized 800K-token prompt and watched it fall over to Claude, where the 200K window quietly truncated the input. The reply was just shorter than expected — no error. Setting
max_input_tokensper model inmodel_listmakes the proxy fail loudly instead. - Streaming +
num_retriesis awkward: When a provider drops mid-stream, retries kick in but Antigravity has already received partial tokens. Keepingnum_retriesat 2 and adding a completion check on the agent side worked better than pushing it higher. - Don't commit
LITELLM_MASTER_KEY: GitHub Secret Scanning will find it. Addinggitleaksas a pre-commit hook onconfig.yamland.envended that class of mistake for me. - Cloud Run cold starts hurt: Roughly ten seconds before the first response when the container has been idle. For team use, set
min-instances=1; for personal setups, keeping a Mac mini awake 24/7 is cheaper than the time you would lose. - Per-key spend limits go further than rate limits: LiteLLM supports
max_budgetper virtual key, which I now treat as a hard ceiling per environment. A junior who accidentally loops overgemini-2.5-procannot blow past the configured monthly budget, because the proxy refuses the request. This single setting has saved me more anxiety than any alerting rule.
Where to go from here
LiteLLM is lighter to set up than it looks, and Antigravity stays out of its way. If you have two or more LLM API keys you actively use, the setup pays for itself within a week. To start today, write the config.yaml above, run docker compose up -d, and add a single Custom Provider in Antigravity. Two extra lines of YAML for a fallback chain are usually enough to remove "quota exhausted" from your daily vocabulary.
If you want to go deeper into observability, our OpenTelemetry pipeline guide for Antigravity shows how to ship LiteLLM metrics into a single dashboard, so you can see latency and error rates per model over time.