How to Fix AI API 429 Rate Limit, Authentication, and Timeout Errors: A Complete Troubleshooting Guide

The Three Most Common AI API Errors

When building with AI APIs, few things are more frustrating than having your requests suddenly fail. Whether it's a rate limit slamming the brakes on your batch processing, an authentication error locking you out, or a timeout leaving your users staring at a spinner — these issues demand fast, reliable fixes.

This guide walks you through the three most frequent error categories across OpenAI, Gemini, and Claude APIs:

429 Too Many Requests (Rate Limiting): You've exceeded the request or token quota
401 / 403 Authentication & Authorization Errors: Invalid API key, expired credentials, or missing permissions
Timeout / Connection Errors: Network issues, oversized requests, or server-side delays

Each section provides the symptoms, root causes, step-by-step solutions, and verification procedures so you can diagnose and resolve issues quickly.

Diagnosing and Fixing 429 Rate Limit Errors

Symptoms

Your API calls return a 429 status code with messages like these:

// OpenAI response
{
  "error": {
    "message": "Rate limit reached for gpt-4o on requests per min (RPM)",
    "type": "tokens",
    "code": "rate_limit_exceeded"
  }
}
 
// Gemini API response
{
  "error": {
    "code": 429,
    "message": "Resource has been exhausted (e.g. check quota).",
    "status": "RESOURCE_EXHAUSTED"
  }
}

Root Causes

Several scenarios can trigger rate limiting:

RPM (Requests Per Minute) exceeded: Too many requests within the time window
TPM (Tokens Per Minute) exceeded: Total token consumption surpassed the quota
Daily quota exhaustion: Free-tier or plan-specific daily limits reached
Burst traffic from batch jobs: Loops or parallel workers flooding the endpoint simultaneously
New account with low default limits: API providers assign conservative rate limits to new accounts that increase gradually over time

Step-by-Step Fix

Step 1: Check your current rate limits

# Read rate limit info from OpenAI response headers
import openai
 
client = openai.OpenAI(api_key="YOUR_API_KEY")
 
response = client.chat.completions.with_raw_response.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}]
)
 
# Extract limit details from headers
print(f"Limit: {response.headers.get('x-ratelimit-limit-requests')}")
print(f"Remaining: {response.headers.get('x-ratelimit-remaining-requests')}")
print(f"Reset: {response.headers.get('x-ratelimit-reset-requests')}")

Step 2: Implement exponential backoff with jitter

import time
import random
 
def call_api_with_retry(func, max_retries=5):
    """Retry API calls with exponential backoff and jitter"""
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if "429" in str(e) or "rate_limit" in str(e):
                # Exponential backoff + random jitter
                wait_time = (2 ** attempt) + random.uniform(0, 1)
                print(f"Rate limit hit. Retrying in {wait_time:.1f}s...")
                time.sleep(wait_time)
            else:
                raise e
    raise Exception("Max retries exceeded")
 
# Usage:
# result = call_api_with_retry(lambda: client.chat.completions.create(...))

Step 3: Add a client-side rate limiter

import asyncio
from collections import deque
from time import time as now
 
class RateLimiter:
    """Token-bucket style rate limiter"""
    def __init__(self, max_requests: int, window_seconds: int = 60):
        self.max_requests = max_requests
        self.window = window_seconds
        self.timestamps: deque = deque()
 
    async def acquire(self):
        while True:
            current = now()
            # Remove timestamps outside the window
            while self.timestamps and self.timestamps[0] < current - self.window:
                self.timestamps.popleft()
 
            if len(self.timestamps) < self.max_requests:
                self.timestamps.append(current)
                return
            # Wait until the oldest request exits the window
            sleep_time = self.timestamps[0] - (current - self.window) + 0.1
            await asyncio.sleep(sleep_time)
 
# Usage: limit to 50 requests per minute
# limiter = RateLimiter(max_requests=50, window_seconds=60)
# await limiter.acquire()
# response = await call_api(...)

Resolving 401 / 403 Authentication Errors

Symptoms

Despite having an API key configured, you receive errors like:

// OpenAI 401
{
  "error": {
    "message": "Incorrect API key provided: sk-xxxx...xxxx.",
    "type": "invalid_request_error",
    "code": "invalid_api_key"
  }
}
 
// Gemini 403
{
  "error": {
    "code": 403,
    "message": "The caller does not have permission",
    "status": "PERMISSION_DENIED"
  }
}
 
// Claude 401
{
  "error": {
    "type": "authentication_error",
    "message": "Invalid API Key"
  }
}

Root Causes

Revoked or rotated API key: The key was invalidated or replaced
Missing or misconfigured environment variables: The .env file isn't loaded or the variable name is wrong
No payment method on file: You've exceeded the free tier but haven't added billing info
IAM or organization permissions: Google Cloud IAM roles or OpenAI org-level restrictions
Whitespace or newline in the key: Extra characters copied along with the key

Step-by-Step Fix

Step 1: Verify your API key works

# Test OpenAI key
curl -s -o /dev/null -w "%{http_code}" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  https://api.openai.com/v1/models
 
# Expected: 200 (valid) / 401 (invalid) / 429 (rate limited)
 
# Test Gemini key
curl -s -o /dev/null -w "%{http_code}" \
  "https://generativelanguage.googleapis.com/v1beta/models?key=YOUR_GEMINI_API_KEY"
 
# Test Claude key
curl -s -o /dev/null -w "%{http_code}" \
  -H "x-api-key: YOUR_CLAUDE_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  https://api.anthropic.com/v1/models

Step 2: Validate your environment variables

# Check that env vars are loaded correctly
# Only show first and last characters to avoid leaking the full key
echo "OPENAI_API_KEY: ${OPENAI_API_KEY:0:8}...${OPENAI_API_KEY: -4}"
echo "GEMINI_API_KEY: ${GEMINI_API_KEY:0:8}...${GEMINI_API_KEY: -4}"
 
# Check for hidden characters in .env file
cat -A .env | head -5
# If lines end with ^M$ instead of just $, you have Windows-style line endings

Step 3: Confirm billing status

Visit each provider's dashboard to verify:

OpenAI: Settings → Billing — check payment method and credit balance
Google AI Studio: Google Cloud Console → APIs & Services → Credentials — verify key status
Anthropic: Console → Plans & Billing — check credit balance

Fixing Timeout and Connection Errors

Symptoms

Requests hang and eventually fail with timeout errors:

# Common error messages
# openai.APITimeoutError: Request timed out.
# requests.exceptions.ReadTimeout: Read timed out. (read timeout=60)
# httpx.ReadTimeout: Read timed out
# google.api_core.exceptions.DeadlineExceeded: 504 Deadline Exceeded

Root Causes

Input too large: Long prompts increase model processing time significantly
High max_tokens setting: Generating thousands of tokens takes proportionally longer
Network issues: Proxies, firewalls, or VPNs interfering with the connection
Server-side load: High demand on popular models causing slower responses
Default SDK timeout too short: Especially when not using streaming mode

Step-by-Step Fix

Step 1: Adjust timeout settings

import openai
import google.generativeai as genai
 
# OpenAI — set explicit timeout
client = openai.OpenAI(
    api_key="YOUR_API_KEY",
    timeout=120.0,  # Default is 600s; adjust for your use case
    max_retries=3   # Built-in retry count
)
 
# Gemini — set timeout via request options
genai.configure(api_key="YOUR_GEMINI_API_KEY")
model = genai.GenerativeModel("gemini-2.0-flash")
response = model.generate_content(
    "Your prompt here",
    request_options={"timeout": 120}  # in seconds
)

Step 2: Switch to streaming responses

# OpenAI streaming — first token arrives much faster
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    stream=True
)
 
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
 
# Gemini streaming
response = model.generate_content("Your prompt", stream=True)
for chunk in response:
    print(chunk.text, end="", flush=True)

Step 3: Optimize input size

import tiktoken
 
def check_token_count(text: str, model: str = "gpt-4o") -> int:
    """Check token count before sending to the API"""
    encoding = tiktoken.encoding_for_model(model)
    token_count = len(encoding.encode(text))
    print(f"Token count: {token_count}")
    return token_count
 
# Split oversized text into manageable chunks
def split_text_by_tokens(text: str, max_tokens: int = 4000) -> list[str]:
    """Split text into chunks based on token count"""
    encoding = tiktoken.encoding_for_model("gpt-4o")
    tokens = encoding.encode(text)
    chunks = []
    for i in range(0, len(tokens), max_tokens):
        chunk_tokens = tokens[i:i + max_tokens]
        chunks.append(encoding.decode(chunk_tokens))
    return chunks

How to Verify the Fix Worked

After applying your fix, run this health check to confirm everything is back to normal:

import time
 
def verify_api_health(client, model="gpt-4o"):
    """Run a quick health check on your API connection"""
    tests = {
        "basic_request": False,
        "rate_limit_ok": False,
        "latency_ok": False,
    }
 
    # Test 1: Does a basic request succeed?
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": "Say OK"}],
            max_tokens=5
        )
        tests["basic_request"] = response.choices[0].message.content is not None
    except Exception as e:
        print(f"Basic request failed: {e}")
 
    # Test 2: Can we send multiple requests without hitting rate limits?
    try:
        for i in range(3):
            client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": f"Test {i}"}],
                max_tokens=5
            )
            time.sleep(1)
        tests["rate_limit_ok"] = True
    except Exception as e:
        print(f"Rate limit test failed: {e}")
 
    # Test 3: Is latency within acceptable range?
    try:
        start = time.time()
        client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": "Hello"}],
            max_tokens=10
        )
        latency = time.time() - start
        tests["latency_ok"] = latency < 30  # Under 30 seconds
        print(f"Latency: {latency:.1f}s")
    except Exception as e:
        print(f"Latency test failed: {e}")
 
    # Print results
    for test, passed in tests.items():
        status = "✅" if passed else "❌"
        print(f"  {status} {test}")
 
    return all(tests.values())
 
# verify_api_health(client)

Prevention Best Practices

Rather than just reacting to errors, design your system to be resilient from the start.

Automate API Key Management

# Use a secrets manager instead of hardcoding env vars
# AWS Secrets Manager
aws secretsmanager get-secret-value \
  --secret-id prod/openai-api-key \
  --query SecretString --output text
 
# Google Secret Manager
gcloud secrets versions access latest --secret="gemini-api-key"

Set Up Usage Monitoring and Alerts

Configure alerts that fire when your API usage reaches 80% of the quota. OpenAI's "Usage limits" dashboard and Google Cloud's "Budgets & Alerts" feature make this straightforward to implement.

Implement Multi-Provider Fallback

If one API goes down, automatically route traffic to an alternative provider. The Antigravity × Cloudflare AI Gateway: Caching Strategy to Reduce LLM Costs by Up to 70% guide covers how to set up response caching and provider fallback in a single configuration.

Looking back

AI API errors — 429 rate limits, authentication failures, and timeouts — all follow predictable patterns once you know what to look for. Here's the quick reference:

429 Errors: Implement exponential backoff + a client-side rate limiter
Auth Errors: Verify key validity → check environment variables → confirm billing status
Timeouts: Enable streaming → optimize input size → increase timeout values

For long-term resilience, invest in secrets management, usage monitoring with alerts, and multi-provider fallback architecture. Once these patterns are baked into your codebase, you'll spend far less time firefighting and more time building.

How to Fix AI API 429 Rate Limit, Authentication, and Timeout Errors: A Complete Troubleshooting Guide

The Three Most Common AI API Errors

Diagnosing and Fixing 429 Rate Limit Errors

Symptoms

Root Causes

Step-by-Step Fix

Resolving 401 / 403 Authentication Errors

Symptoms

Root Causes

Step-by-Step Fix

Fixing Timeout and Connection Errors

Symptoms

Root Causes

Step-by-Step Fix

How to Verify the Fix Worked

Prevention Best Practices

Automate API Key Management

Set Up Usage Monitoring and Alerts

Implement Multi-Provider Fallback

Looking back

Thank You for Reading

Related Articles

Related Articles

⟐ Editor View2026-04-10
VS Code & Cursor AI Extensions Not Working: How to Fix Authentication Errors and Unresponsive Completions
Fix VS Code and Cursor AI extension issues including authentication errors, connection failures, and unresponsive completions with step-by-step troubleshooting for GitHub Copilot, Codeium, and more.

✦ Tips2026-04-19
Error Still Showing After Asking Antigravity to Fix It — 5 Common Causes and Solutions
When Antigravity says it fixed your error but the red squiggles remain, there are usually five culprits. This guide walks through stale caches, wrong file targets, error type mismatch, language server lag, and stale context — with practical steps to resolve each.

✦ Tips2026-04-17
When the google-genai SDK Refuses to Work — Diagnosing and Fixing Python API Errors in Antigravity
A practical guide to diagnosing and fixing the most common google-genai SDK errors in Antigravity: authentication failures, ImportError, deprecated model names, and rate limiting — with working code examples for each.