How a single OpenAI retry loop burned $72,000 overnight — and how to stop yours
Here's a story you'll find on the OpenAI developer forum, paraphrased only slightly:
"I went to bed with $200 in credits. I woke up at 7am to a $72,000 bill. A retry loop in our agent had been hammering GPT-4o all night."
This isn't an outlier. The OpenAI community forum is full of these threads — people losing hundreds or thousands of dollars to a single misconfigured loop, a forgotten while True:, or an agent that decides to recursively call itself when it doesn't know what to do.
And here's the kicker most teams don't realize: OpenAI removed hard budget caps in 2024. The "Monthly budget" field in your billing dashboard is now an alert, not a stop. It will email you. It will not cut you off.
Why this happens more often than people admit
According to Zylo's 2026 SaaS Management Index, 78% of IT leaders saw unexpected charges tied to AI consumption last year. The pattern is consistent across the incidents we've seen:
- Retry loops without exponential backoff. A 429 or a transient 500 triggers an immediate retry. Repeat at 1,000 RPM for eight hours.
- Recursive agents. An agent calls itself when uncertain. One ambiguous prompt and it loops until your card declines.
- Conversation history that never resets. A chatbot's context grows linearly. Token costs grow with it. A long-running session can 10x its own cost without anyone noticing.
- Output tokens are 2–4x more expensive than input tokens. A model that decides to "think out loud" instead of returning JSON can quietly multiply your bill.
- OpenAI's usage page lags. By the time the dashboard catches up, the damage is done.
Why "set a budget alert" isn't enough
OpenAI's billing UI lets you set:
- A soft alert at X dollars (you get an email).
- A hard limit that, in practice, is also delayed and not real-time.
If a runaway loop spends $30,000 in three hours — which has happened — the alert email might arrive at hour 1, but the loop runs unchecked because nobody's awake to kill the process. The "hard limit" enforcement happens on OpenAI's billing cycle, not in real time on the API path. Several developers have reported continuing to be charged well past their stated cap.
And on Anthropic? There's no equivalent of a hard cap at all. You set up usage alerts and that's it.
The proxy pattern that actually works
The fix is structural, not procedural. Instead of:
# Your code POST https://api.openai.com/v1/chat/completions
You route through a thin proxy that maintains a real-time spend counter per agent or per project, and returns a 429 Budget Exceeded before the request ever reaches OpenAI:
# Same request, different host POST https://api.promptcost.io/openai/v1/chat/completions cg-key: sk-pc-••••3f9a # identifies your workspace cg-agent: "lead-scorer" # tags this request x-api-key: sk-•••••••••• # your provider key, untouched
What changes:
- Every request increments a per-agent counter in Redis (low milliseconds).
- If the agent's monthly budget is exceeded, the proxy returns 429 immediately, never forwarding to OpenAI. You don't pay for blocked calls.
- If under budget, the request is forwarded; the response (including
usage) is logged asynchronously. - Your provider key is never persisted — it's just passed through as a header.
What "good" looks like
A safe LLM cost setup, in our opinion, has four properties:
- Per-agent attribution. If you run twelve agents on one provider key, you need to know which one is responsible for spend. "Total monthly bill" is too coarse to act on.
- Real-time enforcement on the request path. Polling the usage API every 15 minutes is not real time. Counters that increment on the request itself are.
- Hard caps that block. An alert is information; a 429 is a fix.
- Zero key storage. Whatever sits between you and OpenAI shouldn't persist your provider keys. Headers in, headers out.
What about retry loops specifically?
The proxy pattern protects you from yourself in three layers:
- Per-agent budget cap — your retry loop can hammer the proxy, but it stops getting forwarded the moment the cap is hit.
- Rate limiting per agent — you can set "max 60 requests/min for this agent" and the proxy returns 429 on the 61st.
- Visibility — even if a loop runs for 30 seconds before being caught, the dashboard shows the spike per agent so you find the bad code instead of digging through logs.
If you're rolling your own
You don't need a service to do this. The pattern is small:
- A reverse proxy (Cloudflare Workers, a Node service, whatever you like) that accepts the OpenAI/Anthropic API shape.
- Redis for the spend counter (
INCRBYFLOATper agent per month). - A pricing table you keep up to date as providers change rates.
- Async logging to your DB of choice for the request log.
The reason most teams don't roll their own is that maintaining the pricing table alone is a part-time job, and getting the budget enforcement right (especially with streaming responses) has nasty edge cases.
Or just use PromptCost.
Drop-in proxy for OpenAI & Anthropic. Per-agent cost tracking. Hard budget caps that return 429 before forwarding. Free forever, no credit card.
Start free →