Antigravity x LiteLLM: Routing Multiple LLM Providers Through a Single Proxy

If you have ever burned through your Gemini free tier before lunch, you know the feeling: you want to keep coding, but every request comes back with Quota exceeded. The first time it happened to me, I lost thirty minutes just rewiring my IDE to a local Gemma 4 instance. I have since done that dance more times than I would like to admit, swapping API keys and base URLs every time a provider misbehaves.

LiteLLM exists to absorb that switching cost into a single proxy. Because Antigravity supports custom OpenAI-compatible endpoints, slipping LiteLLM in between lets you say "Claude today," "fall back to Gemma when things get noisy," or "Gemini for code reviews only" without changing a single setting in the IDE. This guide walks through the setup I actually run, the routing strategies that proved worth the YAML, and the production gotchas I wish I had known about earlier.

Why put LiteLLM in front of Antigravity?

LiteLLM is more than a multi-provider client. Three properties make it especially useful next to Antigravity.

Everything looks like OpenAI: Antigravity's custom provider expects an OpenAI-compatible API. LiteLLM lets you call Gemini, Claude, or Ollama through that same shape.
Declarative fallback chains: A few lines under model_list and fallbacks give you automatic failover on 429s and 5xx responses, with no retry logic on the IDE side.
Cost and latency you can actually see: LiteLLM exposes Prometheus-friendly metrics and request logs, so you can correlate "what did Antigravity do today" with "how much did it cost."

LibreChat covers similar ground but bundles a chat UI, which feels heavy when all you really want is a router. If you only need the proxy, LiteLLM is the more honest fit. (Our self-hosted LibreChat guide covers the alternative if you want both the chat surface and the routing layer.)

The architecture I actually run

Sketched out, the setup looks like this:

Antigravity (the IDE) speaks HTTPS to LiteLLM Proxy
LiteLLM Proxy fans out to Gemini, Claude, OpenAI, and Ollama
The proxy itself runs on Cloud Run or, in my case, a Mac mini that lives under my desk
Local Gemma 4 sits behind Ollama and is registered in the same LiteLLM model_list

The important detail is that Antigravity sees only one endpoint. Once you point the IDE at something like http://localhost:4000/v1, it never has to know which provider is responding underneath.

Setting up the LiteLLM proxy

The minimum viable setup is a Docker Compose stack and a small config.yaml.

# config.yaml — minimal LiteLLM proxy with a fallback chain
# "Gemini first, Claude on 429, local Gemma 4 as last resort"
model_list:
  - model_name: smart  # logical name Antigravity will call
    litellm_params:
      model: gemini/gemini-2.5-pro
      api_key: os.environ/GEMINI_API_KEY
  - model_name: smart-backup
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: smart-local
    litellm_params:
      model: ollama/gemma3:27b
      api_base: http://host.docker.internal:11434
 
router_settings:
  fallbacks:
    - { smart: ["smart-backup", "smart-local"] }
  num_retries: 2
  timeout: 30
 
litellm_settings:
  drop_params: true  # silently drop params Antigravity sends that the model does not accept
  set_verbose: false

drop_params: true is small but mighty. Antigravity occasionally forwards OpenAI-style fields like frequency_penalty, and Gemini will refuse the entire request when it sees a parameter it does not understand. Dropping them silently is much friendlier than a hard error mid-edit.

# docker-compose.yml — copy-paste ready
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    volumes:
      - ./config.yaml:/app/config.yaml
    environment:
      GEMINI_API_KEY: ${GEMINI_API_KEY}
      ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
      LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY}
    command: ["--config", "/app/config.yaml", "--port", "4000"]

docker compose up -d starts the proxy. A quick smoke test with curl -H "Authorization: Bearer $LITELLM_MASTER_KEY" http://localhost:4000/v1/models should return a data array containing smart, smart-backup, and smart-local.

Wiring Antigravity through LiteLLM

In Antigravity's settings, add a Custom OpenAI-compatible Provider with the following values:

Base URL: http://localhost:4000/v1
API Key: whatever you set as LITELLM_MASTER_KEY
Model ID: smart (the logical name from model_list)

That is the entire integration. Type a message in Antigravity's chat panel and LiteLLM handles fallback for you. The first time Gemini gives you a 429 mid-session and your conversation simply continues on Claude, the value of this setup becomes obvious.

If you want local Gemma 4 to be a serious part of your day, pair this with the Antigravity local LLM setup guide. Local inference becomes the safety net for cloud outages instead of a one-off experiment.

Routing strategies — fallback, cost, latency

Once everything works, the next question is "what should I send where?" The honest answer is that no single profile fits every task. Coding sessions, code review, and overnight refactors all have different tolerance for latency, cost, and stylistic quirks. LiteLLM lets you encode that judgment in the proxy itself, so the routing decision is made once and stays consistent across the team.

These are the three profiles I keep coming back to.

Code review: claude-sonnet-4-6 primary, gemini-2.5-pro fallback. Long diffs survive Claude's reading better in my experience, and the way it phrases concerns reads as suggestions rather than commands. The fallback to Gemini matters most on Mondays, when Claude tends to be slower under load.
Bulk refactors: gemini-2.5-pro primary, smart-local (Gemma 4) fallback. The token economics line up with mass edits, and a network blip does not stop the work — Antigravity simply continues against the local model. I have lived through one cross-region outage with this profile and barely noticed.
Personal experiments / nightly batches: smart-local primary, gemini-2.0-flash fallback. Effectively free except for electricity. I run weekend prototypes here, and any task that would be embarrassing to put on a corporate bill ends up routed through this profile.

A common mistake is to treat the proxy as a place to also route between fundamentally different model families for the same prompt. In practice the prompt that makes Claude shine often makes Gemma 4 stumble, and vice versa. Profiles should match how you write prompts, not just which models you happen to have keys for. If you find yourself needing radically different prompts per provider, separate them into different logical model names instead of cramming them into one fallback chain.

LiteLLM also supports routing_strategy: latency-based-routing, which picks whichever model is fastest at the moment. I prefer predictable cost, so I do not use it, but if you build agents with strict latency budgets it is worth a look. (Our Antigravity x Ollama integration guide covers latency tuning on the local side in more depth.)

What actually bit me

Some of these are not in the official docs.

Context length mismatches fail quietly: I sent a Gemini-sized 800K-token prompt and watched it fall over to Claude, where the 200K window quietly truncated the input. The reply was just shorter than expected — no error. Setting max_input_tokens per model in model_list makes the proxy fail loudly instead.
Streaming + num_retries is awkward: When a provider drops mid-stream, retries kick in but Antigravity has already received partial tokens. Keeping num_retries at 2 and adding a completion check on the agent side worked better than pushing it higher.
Don't commit LITELLM_MASTER_KEY: GitHub Secret Scanning will find it. Adding gitleaks as a pre-commit hook on config.yaml and .env ended that class of mistake for me.
Cloud Run cold starts hurt: Roughly ten seconds before the first response when the container has been idle. For team use, set min-instances=1; for personal setups, keeping a Mac mini awake 24/7 is cheaper than the time you would lose.
Per-key spend limits go further than rate limits: LiteLLM supports max_budget per virtual key, which I now treat as a hard ceiling per environment. A junior who accidentally loops over gemini-2.5-pro cannot blow past the configured monthly budget, because the proxy refuses the request. This single setting has saved me more anxiety than any alerting rule.

Where to go from here

LiteLLM is lighter to set up than it looks, and Antigravity stays out of its way. If you have two or more LLM API keys you actively use, the setup pays for itself within a week. To start today, write the config.yaml above, run docker compose up -d, and add a single Custom Provider in Antigravity. Two extra lines of YAML for a fallback chain are usually enough to remove "quota exhausted" from your daily vocabulary.

If you want to go deeper into observability, our OpenTelemetry pipeline guide for Antigravity shows how to ship LiteLLM metrics into a single dashboard, so you can see latency and error rates per model over time.