Antigravity × Gemini File API: A Production Guide to Feeding Long-Form Media (Video, Audio, PDF) into Your Agents
Feed hour-long videos, podcasts, and book-length PDFs into your Antigravity agents with the Gemini File API. A practical, production-oriented pipeline with timestamped highlight extraction, idempotent uploads, cost accounting, and failure recovery.
"I want to summarize a three-hour meeting recording." "I need the timestamps for every slide transition in a two-hour lecture." If you're building agents in Antigravity, these requests land on your desk sooner than you'd think.
The first wall you hit is deceptively simple: how do you actually hand that much media to Gemini? You can't just base64-encode a movie and shove it into the prompt, and streaming isn't really a thing either. The answer is the Gemini File API — but the official docs describe what each endpoint does, not how to wire the whole thing together in a way that survives production.
I've rebuilt this pipeline in Antigravity more times than I care to admit over the past six months, feeding in my own studio recordings, performance archives, and ambient sound sources. Along the way I learned — the hard way — that treating the File API as a dumb uploader almost always ends badly, and that stable timestamped output is 90% a schema-design problem. This guide captures what I wish someone had handed me on day one.
What you'll build
The endgame is unassuming. You point one Python script at a local video, audio, or PDF file and it returns:
an overall summary (400 to 600 characters / ~100 words)
chapters, each with timestamps, title, and mini-summary
highlights — five to ten "you can't miss this" moments with timestamps
token usage and cost in USD and JPY
Wire this into an Antigravity agent and "watch this long video" becomes an agent task instead of a personal chore. I use it to auto-generate minutes from my weekend studio logs, but the same pipeline will carry a dozen other workflows.
Where the File API fits — and why you actually need it
Gemini accepts images, audio, and video in prompts through three different mechanisms:
Inline embedding — base64-encode the asset into the request itself. Fine for small images and clips under ~20 MB
File API — upload once to Google's storage, get a URI, reference it from as many prompts as you like. This is the only realistic path for hundreds of megabytes to multi-gigabyte media
Direct YouTube URL — a convenient shortcut, but only for public YouTube videos, not your own assets
Inline has a soft cap around 20 MB and will reject most hour-long audio outright. The File API, as of April 2026, accepts up to 2 GB, keeps the file for 48 hours, and lets you reference the same file across many inference calls. My rule of thumb: if the MP4 on disk is over 150 MB, don't think twice — use the File API.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦If you've been stuck trying to feed a 3-hour meeting recording into an AI because token limits and upload formats kept breaking, you'll walk away with a working pipeline you can run today
✦You'll learn the exact design patterns production code needs — timestamped highlight extraction, safe retries, idempotent uploads — not a toy demo
✦The pipeline transfers directly to real business use cases like meeting minutes, lecture indexing, and long-form content digests, so you can apply it to your own product immediately
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
I stage this pipeline in four clearly separated steps. The reason isn't love of diagrams; it's that when step four fails, I don't want to re-upload the file. One-shot designs always break the moment the JSON parser trips.
Drop every intermediate result as JSON into a local cache/ directory. Being able to restart at any stage will save you hours on a bad API day.
Setup and dependencies
Inside an Antigravity project, start with a clean Python environment. The Google Gen AI SDK had its upload API reorganized in early 2026, and the google-genai 1.x line is the stable path.
This is where most people stall out. In early iteration, uploading the same file repeatedly burns both time and money. The fix is to key uploads by file hash and reuse cached File objects until they expire.
# file_ingest.py — idempotent uploadsfrom __future__ import annotationsimport hashlibimport jsonimport osimport timefrom pathlib import Pathfrom google import genaifrom google.genai import typesfrom dotenv import load_dotenvload_dotenv()client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])CACHE_DIR = Path(".cache/files")CACHE_DIR.mkdir(parents=True, exist_ok=True)def _sha1(path: Path) -> str: """Return the SHA-1 of a file; used as cache key.""" h = hashlib.sha1() with path.open("rb") as f: for chunk in iter(lambda: f.read(1024 * 1024), b""): h.update(chunk) return h.hexdigest()def upload_or_reuse(local_path: str, display_name: str | None = None) -> dict: """Skip re-uploading identical files; return cached File info.""" path = Path(local_path) key = _sha1(path) cache_path = CACHE_DIR / f"{key}.json" if cache_path.exists(): cached = json.loads(cache_path.read_text()) # File API deletes after 48h if time.time() - cached["uploaded_at"] < 47 * 3600: print(f"♻️ reuse cache: {cached['name']}") return cached print(f"⬆️ uploading {path.name} ({path.stat().st_size / 1024 / 1024:.1f} MB)...") file = client.files.upload( file=path, config=types.UploadFileConfig(display_name=display_name or path.name), ) # Wait until ACTIVE while file.state.name == "PROCESSING": time.sleep(2) file = client.files.get(name=file.name) if file.state.name == "FAILED": raise RuntimeError(f"upload failed: {file.error}") info = { "name": file.name, "uri": file.uri, "mime_type": file.mime_type, "display_name": file.display_name, "uploaded_at": time.time(), } cache_path.write_text(json.dumps(info, ensure_ascii=False, indent=2)) print(f"✅ ready: {file.name}") return info
Notice the 47-hour cutoff — one hour shy of the 48-hour limit. Cutting it closer invites the File to disappear mid-inference. I lost a 30-minute inference run that way once; never again.
Stage 1 gotchas
Calling inference before the File hits ACTIVE: Immediately after upload, the File sits in PROCESSING for a few seconds. Reference it too early and you'll get 400 FILE_STATE_UNSPECIFIED. Always wait for ACTIVE.
Hard-coding video/mp4 as the MIME type: .mov and .webm files are everyday reality. Let the SDK figure it out.
Hammering the upload endpoint in parallel: files.upload is thread-safe, but it will saturate your uplink. Two to three concurrent uploads is the practical ceiling.
Stage 2: a reliable overall summary
Embedding an overall summary in Stage 3's prompt dramatically improves accuracy. Humans skim an abstract before picking out highlights for the same reason.
# summary.pyfrom google.genai import typesfrom file_ingest import clientSUMMARY_MODEL = "gemini-2.5-pro" # Pro stays stable over long mediadef summarize(file_info: dict, language: str = "en") -> str: system = ( "You are a media-summarization specialist. " "Summarize the provided video, audio, or PDF in chronological order " "in about 400–600 characters (roughly 100 words). " "Write only what the media actually states — no speculation, no general commentary." ) contents = [ types.Part.from_uri(file_uri=file_info["uri"], mime_type=file_info["mime_type"]), types.Part.from_text(text=f"Summarize the media above in {language}."), ] try: resp = client.models.generate_content( model=SUMMARY_MODEL, contents=contents, config=types.GenerateContentConfig( system_instruction=system, temperature=0.2, # low temperature for reproducibility ), ) except Exception as e: raise RuntimeError(f"summary failed: {e}") from e return resp.text.strip()
temperature=0.2 matters. I once ran this at 0.7 and got a wildly different summary each time — which, in turn, produced different chapter titles every run, which killed the cache layer. Low temperature here pays dividends downstream.
Stage 3: structured output for timestamped highlights
This is the heart of the piece. Asking Gemini to "please return JSON with chapters and timestamps" is not a production strategy. You enforce the shape with Structured Output and a Pydantic schema.
# extract.py — timestamped highlightsfrom __future__ import annotationsfrom pydantic import BaseModel, Fieldfrom google.genai import typesfrom file_ingest import clientEXTRACT_MODEL = "gemini-2.5-pro"class Chapter(BaseModel): start: str = Field(..., description="Chapter start timestamp HH:MM:SS") end: str = Field(..., description="Chapter end timestamp HH:MM:SS") title: str = Field(..., description="Chapter title, max 30 chars") summary: str = Field(..., description="Chapter summary, 120-200 chars")class Highlight(BaseModel): timestamp: str = Field(..., description="Highlight start HH:MM:SS") title: str reason: str = Field(..., description="Why this is a highlight, 60-100 chars")class MediaDigest(BaseModel): overall_summary: str chapters: list[Chapter] highlights: list[Highlight]def extract_digest(file_info: dict, overall_summary: str, language: str = "en") -> MediaDigest: system = ( "You are a structured-extraction engine for media. " "Given the media and its preliminary summary, emit chapters and highlights. " "Timestamps must be HH:MM:SS (MM:SS is acceptable) and must refer to real points in the media. " "Never invent timestamps." ) contents = [ types.Part.from_uri(file_uri=file_info["uri"], mime_type=file_info["mime_type"]), types.Part.from_text( text=( f"# Preliminary summary\n{overall_summary}\n\n" f"# Instruction\nExtract chapters and highlights from the media above in {language}. " "Aim for 5–10 chapters and 3–8 highlights." ) ), ] try: resp = client.models.generate_content( model=EXTRACT_MODEL, contents=contents, config=types.GenerateContentConfig( system_instruction=system, response_mime_type="application/json", response_schema=MediaDigest, # Pydantic works directly temperature=0.3, ), ) except Exception as e: raise RuntimeError(f"extract failed: {e}") from e if resp.parsed is None: raise ValueError("Gemini response could not be parsed into MediaDigest") return resp.parsed # type: ignore[return-value]
The reason this reliably yields parseable JSON is that response_schema receives a Pydantic class. Gemini is then constrained (with very high probability) to match that schema, and json.loads stops exploding.
One practical quirk: even when you say "use HH:MM:SS" in the system prompt, sub-one-hour media often comes back as MM:SS. Trying to force the format tends to hurt accuracy elsewhere, so I normalize in post.
# post-processingdef normalize_ts(ts: str) -> str: parts = ts.split(":") if len(parts) == 2: return "00:" + ts return ts
Stage 3 gotchas
JSON cut off by output limit: Asking for many chapters and highlights can truncate the response. Set max_output_tokens (8192 is a safe starting point) explicitly.
Timestamps outside the media's duration: You may see 01:15:00 come back for a 60-minute file. Validate in post and drop out-of-range highlights.
Near-duplicate highlights: When two highlights land within ~30 seconds, the result looks cluttered. Merge adjacent ones in post-processing for a cleaner UX.
Stage 4: post-processing and cost accounting
The Gemini API returns token counts in usage_metadata. Logging cost per call is the single highest-value operational habit I've picked up, because "one overnight batch that cost 20× what I expected" has happened to me more than once.
# cost.py# Rates as of April 2026. Always re-check the official pricing page.PRICE_PER_1M_INPUT_USD = 1.25 # Gemini 2.5 Pro, text-equivalentPRICE_PER_1M_OUTPUT_USD = 10.0USD_JPY = 150def estimate_cost(usage) -> dict: """Take response.usage_metadata, return a rough cost estimate.""" in_tok = getattr(usage, "prompt_token_count", 0) or 0 out_tok = getattr(usage, "candidates_token_count", 0) or 0 usd = ( in_tok / 1_000_000 * PRICE_PER_1M_INPUT_USD + out_tok / 1_000_000 * PRICE_PER_1M_OUTPUT_USD ) return { "input_tokens": in_tok, "output_tokens": out_tok, "usd": round(usd, 4), "jpy": round(usd * USD_JPY, 1), }
Putting it all together
Here's the entry point — a CLI that chains the stages and writes the final JSON to disk. Exceptions from each stage bubble up individually so you know exactly where a run died.
{ "overall_summary": "A meeting held on 2026-04-10 in which...", "digest": { "chapters": [ {"start": "00:00:00", "end": "00:12:30", "title": "Agenda & KPI review", "summary": "..."}, ... ], "highlights": [ {"timestamp": "00:47:15", "title": "Price agreement on the new plan", "reason": "..."} ] }}
Three traps that only show up in production
Everything above gets you a working pipeline. Running it regularly introduces a different class of failure. In the order I actually hit them:
1. Forgetting the 48-hour retention and watching a nightly batch die
File API objects are auto-deleted after 48 hours. Long batches that reuse a URI will die on the weekend run, reliably. Re-check ACTIVE before every inference and assume deletion is always possible. My standard practice: a single files.get() guard before every call.
2. Treating a 429 as a terminal error and re-running the whole pipeline
429s from Gemini are transient. If you throw on the first one, you'll re-run the entire summary stage unnecessarily. Retry with exponential backoff (three attempts), then surface the error. A tiny tenacity wrapper or a hand-rolled decorator is enough.
3. Empty highlights in sparse media
For ASMR streams or slow-paced lectures, the model sometimes returns an empty highlight array because nothing "pops." Treating empty as failure will ruin the UX. Design the downstream UI for zero-highlight cases from day one.
Where this pipeline earns its keep
What you just built generalizes into several real production use cases with minimal changes:
Automated meeting minutes — swap Stage 3's prompt to focus on decisions and action items, feed in audio, done
Searchable lecture indexes — persist chapters and summaries to Supabase and users can jump to "the part about matrix factorization"
Long-stream highlight reels — pair timestamps with ffmpeg to auto-cut a five-minute digest from a three-hour livestream
Bulk podcast metadata — run 50 episodes overnight and you'll have a complete, consistent archive by morning
I personally use it to route raw studio recordings into Notion for archival. Plugging this File API stage into the flow described in the Antigravity × Notion API doc-driven development guide lets Notion receive structured chapters automatically.
Cost and trade-offs
Long-form media on Pro models is not free. Rough rule of thumb for a 1-hour video with Gemini 2.5 Pro running chapters + highlights once: ~0.2–0.5 USD (input 300k–1M tokens, output 20k–50k).
Cost-first: Run Stage 2 on Flash, keep Stage 3 on Pro. Roughly halves total cost.
Accuracy-first: Put both stages on Pro and add a second pass that re-summarizes per chapter. For minutes-quality use cases.
Speed-first: For audio-only inputs, 16 kHz MP3 uploads noticeably faster than 48 kHz WAV. Benchmark fidelity vs. upload time in your actual pipeline.
If you want to extend this foundation, the natural adjacencies are:
Flip direction with generative video — pair this pipeline with the Veo 3 Google video AI first guide to build "long input → short generated output" workflows end to end.
Pick one thirty-minute-or-longer video or audio file sitting on your drive, and run it through this pipeline tomorrow during a coffee break. The moment Gemini returns chapters for something you never had time to actually watch, your relationship with long-form media changes a little.
When you're ready to deploy it for real, the first extension I'd recommend is swapping Stage 1's cache to S3 or Cloudflare R2. After that, most of the operational work you need is already in the code you just wrote.
Appendix A: A retry decorator that doesn't waste your uploads
The single line in production that has saved me the most Gemini budget looks like this: a decorator that retries transient failures but stops at permanent ones. Copy it, adapt the error types to your SDK version, and wire it around every generate_content call.
# retry.pyfrom __future__ import annotationsimport randomimport timefrom functools import wrapsfrom typing import Callable, TypeVarT = TypeVar("T")TRANSIENT_STATUSES = {429, 500, 502, 503, 504}def retry_transient(max_attempts: int = 3, base_delay: float = 1.5) -> Callable: """Exponential backoff for transient Gemini API failures.""" def decorator(fn: Callable[..., T]) -> Callable[..., T]: @wraps(fn) def wrapper(*args, **kwargs) -> T: last_exc: Exception | None = None for attempt in range(max_attempts): try: return fn(*args, **kwargs) except Exception as e: # noqa: BLE001 status = getattr(e, "status_code", None) or getattr(e, "code", None) if status not in TRANSIENT_STATUSES: raise last_exc = e delay = base_delay * (2 ** attempt) + random.random() print(f"⚠️ transient {status} — retry in {delay:.1f}s (attempt {attempt + 1}/{max_attempts})") time.sleep(delay) assert last_exc is not None raise last_exc return wrapper return decorator
Wrap the Stage 2 and Stage 3 callers — not the upload itself — with this decorator. Retrying the upload on a 429 is usually counterproductive because the throttle is per-project, so the upload will simply fail again a few seconds later; you want backoff on the inference calls that actually consume tokens.
Appendix B: When the file doesn't fit the File API either
There's an uncomfortable middle ground: files that exceed the 2 GB File API limit but fall short of what you can reasonably stream. For four-hour livestreams and lossless multitrack recordings, this is a real constraint. Two strategies work in practice.
The first is pre-processing: transcode down to 16 kHz mono MP3 at ~64 kbps for audio-only analysis, or re-encode video to 720p H.264 at a lower bitrate. A ten-gigabyte lossless master can usually become a 300–500 MB File API-friendly asset without meaningfully degrading what the model extracts. I keep a small ffmpeg helper in every project for exactly this:
The second strategy is splitting. If you must preserve fidelity or the media is genuinely irreducible, split the asset into overlapping 90-minute chunks, run the pipeline per chunk, and merge the digests in post. The catch is that chapter timestamps become chunk-relative — you'll need to add the chunk offset back in before returning anything to the user. Expect a solid day of engineering to get the merge logic right; it's not impossible, but it's not free either.
Appendix C: Deploying this as an Antigravity agent tool
The pipeline as written is a CLI. To let an Antigravity agent invoke it directly, wrap it in a simple HTTP endpoint and register it as an MCP tool or a plain-old function the agent can call. I prefer FastAPI for the glue because it's trivial to deploy behind Cloudflare Workers or a tiny VM.
Expose this with proper auth (an X-API-Key header is fine for internal use), point your agent at it, and the agent can now process hour-long media without any further code from you. The nice thing about building it this way is that the CLI remains the source of truth — your agent tool is just a thin HTTP adapter over a well-tested entry point.
Appendix D: How I debug when the digest looks wrong
When chapters or highlights feel off, my debugging order is always the same:
Dump the raw response before parsing. If the JSON came out malformed, you need to know before Pydantic eats the error. resp.text shows you exactly what Gemini sent.
Validate every timestamp against the media duration (ffprobe -show_entries format=duration). Out-of-range timestamps usually indicate the model hallucinated based on the summary rather than watching the content.
Lower the temperature further (0.1) and re-run. If results stabilize, your prompt is under-constrained.
Add examples to the system prompt. One or two few-shot chapter examples dramatically improve consistency on niche domains (legal depositions, medical lectures, music performance).
This debugging loop takes maybe ten minutes per iteration and will teach you more about the File API's behavior than any documentation page.
Closing thought
The File API isn't glamorous — it's an upload endpoint and a URI. But the moment you pair it with structured output and a little caching, it unlocks a class of workflows that simply weren't economically feasible before. Three-hour recordings that used to sit unwatched become searchable. Archives that were technically "organized" but practically unsearchable become queryable. And agents running in Antigravity suddenly have access to all of it.
Start with one file. Get the digest back. Then ask yourself what else you'd like your agent to do with that digest — because that second question is where the real product lives.
Appendix E: A real-world run — from raw recording to agent memory
To make this concrete, here's an end-to-end walkthrough I actually ran while writing this article. I took a 71-minute practice recording from one of my own shows — ambient background, light narration, a few abrupt transitions — and ran it through the pipeline untouched.
The ingest stage uploaded 138 MB in roughly 14 seconds on a fiber connection. The File transitioned from PROCESSING to ACTIVE after a single 2-second poll. Stage 2 produced a 520-character summary in about 18 seconds; the summary correctly identified the three major movements in the recording even though I never explicitly labeled them. Stage 3 took 41 seconds and returned 7 chapters and 5 highlights. One chapter title was overly generic ("Transition and Ambient Section"), and one highlight pointed to a 30-second silence at the 42-minute mark — not exactly "can't miss," but an honest mistake given how quiet the rest of the recording is.
Cost for the full run, per the cost accounting module, came to roughly $0.24. For perspective, the same recording would have taken me about 75 minutes to listen through and annotate by hand — and I'd have been less reliable with the timestamps than the model was.
The interesting part for production use wasn't the per-run experience; it was the second run. I tweaked the system prompt to ask for chapter titles that include at least one concrete musical or sonic attribute, re-ran against the cached File (no re-upload, no re-waiting), and got back seven much better chapters for the cost of the inference alone — about $0.12. That delta between "first pipeline run" and "iterative refinement on cached media" is precisely why the caching discipline in Stage 1 matters so much.
Appendix F: Quick operational checklist
Print this, tape it next to your desk, and you'll avoid most of the mistakes I've already made for you:
Never trust that a files.upload() call left the File in ACTIVE state. Always files.get() before the first inference.
Keep a .cache/ directory and commit its .gitignore, not its contents.
Wrap every generate_content() call in a retry-transient decorator.
Always serialize usage_metadata to logs. Bill shock is preventable.
Normalize timestamps in post, and validate them against the actual media duration.
Treat empty highlights as a legitimate outcome in your UI.
Re-check official pricing at the start of every month. Rates move.
One more honest note
Long-form media processing is one of those areas where the marketing copy has outpaced the tooling slightly. The APIs work — this article is proof you can build real pipelines today — but there are still sharp edges. The 48-hour retention. The occasional MM:SS vs HH:MM:SS drift. The silent truncation when output tokens run out. None of them are show-stoppers, but all of them will bite you once, and you'll remember the first time each one does.
I'd rather you hear that from someone who has been bitten than discover it in production on your own. And I'd rather you finish this article with a pipeline that actually works than an abstract sense that the File API is "powerful." So go run it on something real this week — even a short recording — and let the first couple of runs teach you what this API actually feels like. That's the fastest way to get comfortable enough to ship.
Share
Thank You for Reading
Antigravity Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.