The Three Most Common AI API Errors
When building with AI APIs, few things are more frustrating than having your requests suddenly fail. Whether it's a rate limit slamming the brakes on your batch processing, an authentication error locking you out, or a timeout leaving your users staring at a spinner — these issues demand fast, reliable fixes.
This guide walks you through the three most frequent error categories across OpenAI, Gemini, and Claude APIs:
- 429 Too Many Requests (Rate Limiting): You've exceeded the request or token quota
- 401 / 403 Authentication & Authorization Errors: Invalid API key, expired credentials, or missing permissions
- Timeout / Connection Errors: Network issues, oversized requests, or server-side delays
Each section provides the symptoms, root causes, step-by-step solutions, and verification procedures so you can diagnose and resolve issues quickly.
Diagnosing and Fixing 429 Rate Limit Errors
Symptoms
Your API calls return a 429 status code with messages like these:
// OpenAI response
{
"error": {
"message": "Rate limit reached for gpt-4o on requests per min (RPM)",
"type": "tokens",
"code": "rate_limit_exceeded"
}
}
// Gemini API response
{
"error": {
"code": 429,
"message": "Resource has been exhausted (e.g. check quota).",
"status": "RESOURCE_EXHAUSTED"
}
}Root Causes
Several scenarios can trigger rate limiting:
- RPM (Requests Per Minute) exceeded: Too many requests within the time window
- TPM (Tokens Per Minute) exceeded: Total token consumption surpassed the quota
- Daily quota exhaustion: Free-tier or plan-specific daily limits reached
- Burst traffic from batch jobs: Loops or parallel workers flooding the endpoint simultaneously
- New account with low default limits: API providers assign conservative rate limits to new accounts that increase gradually over time
Step-by-Step Fix
Step 1: Check your current rate limits
# Read rate limit info from OpenAI response headers
import openai
client = openai.OpenAI(api_key="YOUR_API_KEY")
response = client.chat.completions.with_raw_response.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}]
)
# Extract limit details from headers
print(f"Limit: {response.headers.get('x-ratelimit-limit-requests')}")
print(f"Remaining: {response.headers.get('x-ratelimit-remaining-requests')}")
print(f"Reset: {response.headers.get('x-ratelimit-reset-requests')}")Step 2: Implement exponential backoff with jitter
import time
import random
def call_api_with_retry(func, max_retries=5):
"""Retry API calls with exponential backoff and jitter"""
for attempt in range(max_retries):
try:
return func()
except Exception as e:
if "429" in str(e) or "rate_limit" in str(e):
# Exponential backoff + random jitter
wait_time = (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limit hit. Retrying in {wait_time:.1f}s...")
time.sleep(wait_time)
else:
raise e
raise Exception("Max retries exceeded")
# Usage:
# result = call_api_with_retry(lambda: client.chat.completions.create(...))Step 3: Add a client-side rate limiter
import asyncio
from collections import deque
from time import time as now
class RateLimiter:
"""Token-bucket style rate limiter"""
def __init__(self, max_requests: int, window_seconds: int = 60):
self.max_requests = max_requests
self.window = window_seconds
self.timestamps: deque = deque()
async def acquire(self):
while True:
current = now()
# Remove timestamps outside the window
while self.timestamps and self.timestamps[0] < current - self.window:
self.timestamps.popleft()
if len(self.timestamps) < self.max_requests:
self.timestamps.append(current)
return
# Wait until the oldest request exits the window
sleep_time = self.timestamps[0] - (current - self.window) + 0.1
await asyncio.sleep(sleep_time)
# Usage: limit to 50 requests per minute
# limiter = RateLimiter(max_requests=50, window_seconds=60)
# await limiter.acquire()
# response = await call_api(...)Resolving 401 / 403 Authentication Errors
Symptoms
Despite having an API key configured, you receive errors like:
// OpenAI 401
{
"error": {
"message": "Incorrect API key provided: sk-xxxx...xxxx.",
"type": "invalid_request_error",
"code": "invalid_api_key"
}
}
// Gemini 403
{
"error": {
"code": 403,
"message": "The caller does not have permission",
"status": "PERMISSION_DENIED"
}
}
// Claude 401
{
"error": {
"type": "authentication_error",
"message": "Invalid API Key"
}
}Root Causes
- Revoked or rotated API key: The key was invalidated or replaced
- Missing or misconfigured environment variables: The
.envfile isn't loaded or the variable name is wrong - No payment method on file: You've exceeded the free tier but haven't added billing info
- IAM or organization permissions: Google Cloud IAM roles or OpenAI org-level restrictions
- Whitespace or newline in the key: Extra characters copied along with the key
Step-by-Step Fix
Step 1: Verify your API key works
# Test OpenAI key
curl -s -o /dev/null -w "%{http_code}" \
-H "Authorization: Bearer YOUR_API_KEY" \
https://api.openai.com/v1/models
# Expected: 200 (valid) / 401 (invalid) / 429 (rate limited)
# Test Gemini key
curl -s -o /dev/null -w "%{http_code}" \
"https://generativelanguage.googleapis.com/v1beta/models?key=YOUR_GEMINI_API_KEY"
# Test Claude key
curl -s -o /dev/null -w "%{http_code}" \
-H "x-api-key: YOUR_CLAUDE_API_KEY" \
-H "anthropic-version: 2023-06-01" \
https://api.anthropic.com/v1/modelsStep 2: Validate your environment variables
# Check that env vars are loaded correctly
# Only show first and last characters to avoid leaking the full key
echo "OPENAI_API_KEY: ${OPENAI_API_KEY:0:8}...${OPENAI_API_KEY: -4}"
echo "GEMINI_API_KEY: ${GEMINI_API_KEY:0:8}...${GEMINI_API_KEY: -4}"
# Check for hidden characters in .env file
cat -A .env | head -5
# If lines end with ^M$ instead of just $, you have Windows-style line endingsStep 3: Confirm billing status
Visit each provider's dashboard to verify:
- OpenAI: Settings → Billing — check payment method and credit balance
- Google AI Studio: Google Cloud Console → APIs & Services → Credentials — verify key status
- Anthropic: Console → Plans & Billing — check credit balance
Fixing Timeout and Connection Errors
Symptoms
Requests hang and eventually fail with timeout errors:
# Common error messages
# openai.APITimeoutError: Request timed out.
# requests.exceptions.ReadTimeout: Read timed out. (read timeout=60)
# httpx.ReadTimeout: Read timed out
# google.api_core.exceptions.DeadlineExceeded: 504 Deadline ExceededRoot Causes
- Input too large: Long prompts increase model processing time significantly
- High max_tokens setting: Generating thousands of tokens takes proportionally longer
- Network issues: Proxies, firewalls, or VPNs interfering with the connection
- Server-side load: High demand on popular models causing slower responses
- Default SDK timeout too short: Especially when not using streaming mode
Step-by-Step Fix
Step 1: Adjust timeout settings
import openai
import google.generativeai as genai
# OpenAI — set explicit timeout
client = openai.OpenAI(
api_key="YOUR_API_KEY",
timeout=120.0, # Default is 600s; adjust for your use case
max_retries=3 # Built-in retry count
)
# Gemini — set timeout via request options
genai.configure(api_key="YOUR_GEMINI_API_KEY")
model = genai.GenerativeModel("gemini-2.0-flash")
response = model.generate_content(
"Your prompt here",
request_options={"timeout": 120} # in seconds
)Step 2: Switch to streaming responses
# OpenAI streaming — first token arrives much faster
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain quantum computing"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
# Gemini streaming
response = model.generate_content("Your prompt", stream=True)
for chunk in response:
print(chunk.text, end="", flush=True)Step 3: Optimize input size
import tiktoken
def check_token_count(text: str, model: str = "gpt-4o") -> int:
"""Check token count before sending to the API"""
encoding = tiktoken.encoding_for_model(model)
token_count = len(encoding.encode(text))
print(f"Token count: {token_count}")
return token_count
# Split oversized text into manageable chunks
def split_text_by_tokens(text: str, max_tokens: int = 4000) -> list[str]:
"""Split text into chunks based on token count"""
encoding = tiktoken.encoding_for_model("gpt-4o")
tokens = encoding.encode(text)
chunks = []
for i in range(0, len(tokens), max_tokens):
chunk_tokens = tokens[i:i + max_tokens]
chunks.append(encoding.decode(chunk_tokens))
return chunksHow to Verify the Fix Worked
After applying your fix, run this health check to confirm everything is back to normal:
import time
def verify_api_health(client, model="gpt-4o"):
"""Run a quick health check on your API connection"""
tests = {
"basic_request": False,
"rate_limit_ok": False,
"latency_ok": False,
}
# Test 1: Does a basic request succeed?
try:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": "Say OK"}],
max_tokens=5
)
tests["basic_request"] = response.choices[0].message.content is not None
except Exception as e:
print(f"Basic request failed: {e}")
# Test 2: Can we send multiple requests without hitting rate limits?
try:
for i in range(3):
client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": f"Test {i}"}],
max_tokens=5
)
time.sleep(1)
tests["rate_limit_ok"] = True
except Exception as e:
print(f"Rate limit test failed: {e}")
# Test 3: Is latency within acceptable range?
try:
start = time.time()
client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": "Hello"}],
max_tokens=10
)
latency = time.time() - start
tests["latency_ok"] = latency < 30 # Under 30 seconds
print(f"Latency: {latency:.1f}s")
except Exception as e:
print(f"Latency test failed: {e}")
# Print results
for test, passed in tests.items():
status = "✅" if passed else "❌"
print(f" {status} {test}")
return all(tests.values())
# verify_api_health(client)Prevention Best Practices
Rather than just reacting to errors, design your system to be resilient from the start.
Automate API Key Management
# Use a secrets manager instead of hardcoding env vars
# AWS Secrets Manager
aws secretsmanager get-secret-value \
--secret-id prod/openai-api-key \
--query SecretString --output text
# Google Secret Manager
gcloud secrets versions access latest --secret="gemini-api-key"Set Up Usage Monitoring and Alerts
Configure alerts that fire when your API usage reaches 80% of the quota. OpenAI's "Usage limits" dashboard and Google Cloud's "Budgets & Alerts" feature make this straightforward to implement.
Implement Multi-Provider Fallback
If one API goes down, automatically route traffic to an alternative provider. The Antigravity × Cloudflare AI Gateway: Caching Strategy to Reduce LLM Costs by Up to 70% guide covers how to set up response caching and provider fallback in a single configuration.
Looking back
AI API errors — 429 rate limits, authentication failures, and timeouts — all follow predictable patterns once you know what to look for. Here's the quick reference:
- 429 Errors: Implement exponential backoff + a client-side rate limiter
- Auth Errors: Verify key validity → check environment variables → confirm billing status
- Timeouts: Enable streaming → optimize input size → increase timeout values
For long-term resilience, invest in secrets management, usage monitoring with alerts, and multi-provider fallback architecture. Once these patterns are baked into your codebase, you'll spend far less time firefighting and more time building.