Rebuilding Levee with Claude, by Tahir Hashmi

March 8, 2026

I had been building levee, a self-configuring circuit breaker and rate limiter, on and off for almost a year. Then I over-designed it to failure. Part of the blame was mine, for trusting faulty benchmarks – implemented wrongly by, yes, Claude. It was only when I started questioning the results that Claude owned up to the flaws in its own benchmark logic. After several rounds of re-architecting them for a realistic simulation, the verdict was a gut-punch: Levee was the worst performing circuit breaker of the lot, by a huge margin.

In the past I’d have gone digging through my assumptions to find where the model failed to match reality – or simulation, in this case. This time, something told me to step back and let AI take over instead.

AI had been great everywhere else I’d used it. loadgen was almost one-shot. Other stuff I build with it is frankly fantasy-landish, but the job gets done and the code isn’t bad. All my AI grief had been on Levee, and Levee alone. It had even botched the simulations and handed me an illusion of success. Why was this one project so different?

It reminded me of the first time I watched a robot vacuum. The early runs were bizarre – it ignored the trash right next to it and circled furniture legs like a clueless dummy. A few rounds in, I realised I was the clueless one. The robot wasn’t trying to pick up trash; it was trying to carpet the whole floor. Trash removal was a side-effect, and it always got the job done, to the limit of its abilities.

Maybe I’d been too involved with Levee, too sure I knew best. Either way, armed with a much stronger battery of benchmarks, I handed the keys to Claude Opus 4.6. What follows is Claude’s account of how it went, edited down for brevity.

1. Rebuilding Levee

The brief was deliberately vague:

Look at the README.md. It describes a product that you need to build. The skeleton of that product is provided in levee.go. There are some benchmarks in the benchmark directory. Your goal is to make levee top the benchmarks. I understand that this is a very vague ask, so we will take some time chatting back and forth for you to determine the requirements concretely. Then when I tell you explicitly, you develop a plan.

After reading the README, the levee.go skeleton, and the two benchmark suites, the assistant proposed adaptive concurrency control with EWMA-based error detection and asked five clarifying questions. The user granted full design freedom with one hard constraint:

you must not try to fit the benchmarks. The solution has to be general. The benchmarks may be changed later and levee must still win. You must not modify benchmarks.

The user also warned against anchoring on that first instinct, which echoed relics of an earlier failed attempt, and reframed the optimisation target: maximise Delta = SuccessScore - FailureScore, not minimise concurrency or crashes. (Blocking everything, or allowing everything, are both pathological ways to win on a single metric.)

The most consequential message established the design’s three pillars:

Something important that is missing from that design is this: when the circuit opens due to overload, there’s 0 work getting done. That’s fine for when the backend is b0rked. But often the situation is simply that load > capacity. Levee should be able to do that matching between load and capacity, while also pushing the backend hard enough for it to scale up, while also maintaining a reasonable error profile. The SLO is also not an instantaneous metric. It can be assumed to have a window of evaluation… say 10-100s. Lastly, there is concurrency. It matters because – and this is something the benchmarks don’t model – the upstream concurrency runaway is a cascading failure.

In short: match load to capacity instead of just opening, treat the SLO as a window rather than an instant, and cap concurrency so requests stuck behind timeouts can’t choke the caller.

When the assistant reached for a conventional circuit-breaker state machine backed by an EWMA, the user pointed at an earlier attempt, smartcb, and was blunt about its author:

The user who built smartcb is not very bright, and certainly not very knowledgeable.

The lesson: binary open/close is insufficient, because when the circuit opens during overload, zero work gets done.

These inputs converged on the design – a dynamic inflight limiter using MIMD (Multiplicative Increase / Multiplicative Decrease) control, analogous to TCP congestion control. The key insight: overload moves the breaker from CLOSED to THROTTLED, not OPEN, shedding only the excess while keeping throughput flowing.

2. The Benchmark Environment

Two benchmarks drove the work, both built from a 28-hour Cyber Monday traffic scenario (122 load specifications):

Open-loop (prescient): a breaker with perfect foreknowledge acts as an oracle, and each circuit breaker is scored on how often it blocks good traffic or allows bad traffic (lower penalty is better).
Distributed (closed-loop): the same workload drives a simulated backend with HPA autoscaling (1-8 replicas, 30s provisioning lag), load-dependent latency and error escalation, crash-and-restart, and queue shedding. Scoring runs on 200ms epochs with Delta = SuccessScore - FailureScore; the quadratic-inside-sqrt formula punishes concentrated failure bursts hard. One subtlety matters: shed requests report the full 1.5s SLO timeout as their latency, creating a feedback delay for any latency-based signal.

The competitors were two statically-tuned breakers (Static-Peak and Static-BAU) and No-CB, which passes everything through.

3. How It Turned Out

The full design, every experiment, and all the dead ends are documented in EVOLUTION.md. The short version:

The design. Levee became a four-state machine: CLOSED, THROTTLED, OPEN, HALF_OPEN. THROTTLED is the workhorse. Instead of slamming the circuit shut on overload, it runs an MIMD control law that adjusts an inflight limit – multiply by sqrt(2) when errors are within budget, halve when over – matching admitted load to backend capacity. OPEN is reserved for catastrophic failure, and HALF_OPEN handles cautious probing on the way back, with backoff so a fleet of instances doesn’t hammer a recovering backend.

The grind. The assistant tried roughly 20 variants of the control law – different multipliers, AIMD, a P-controller, binary search, dead zones – and plain multiplicative control beat all of them. Nearly every change that reduced oscillation-driven shedding also cut throughput by the same amount, because the fast response that causes shedding is what enables fast recovery.

The breakthrough. For a long time Levee only narrowly beat or trailed the best static breaker, and the gap looked structural. Then, during a code review, Google Gemini Pro found the real cause: the EWMA was weighted by wall-clock time, so the silent gap during an OPEN cooldown arrived as one huge-weight sample that wiped out the error-rate history – triggering premature recovery, a flood of traffic, and an instant re-trip. Capping the EWMA weight at one inter-observation interval fixed it, worth about 11,000 Delta. What had looked like a scoring disadvantage was a signal-processing bug.

The result. Levee now wins every benchmark configuration tested – 20 out of 20 – across isolated runs, a cooperative fleet of 100 instances, and the full SLO and queue-depth sweeps, beating Static-Peak everywhere. The rewrite also collapsed the previous 1,327 lines across four files into a single file of about 535 lines. Simpler, and better.

Conclusion

Rebuilding Levee taught me how to work with AI by showing me, twice, how not to. On the design, I kept insisting the floor be cleaned my way, when the robot had a perfectly good method of its own and all I did was trip it up. On the benchmarks, I made the reverse mistake: I said what I wanted but never checked that Claude understood it as I did, and the gaps in understanding became bugs whose results then led the design astray. Same issue both times – we were never quite working from the same picture. The win came not from nannying what code to write, but from clearly stating the design constraints and setting up an evaluation framework Claude could code against.

The best Levee yet is now on GitHub. Go take it for a spin.