The View From the Hot Seat
Circuit breakers are wonderful intermediaries between two high-traffic synchronous network services. They prevent the caller from overloading the callee when the latter is in trouble. In turn, this protects the caller from getting backed up with too many pending requests. This is all assuming that the circuit breaker is well configured, as are the timeouts and rate limits.
Unfortunately, I’ve seen way too many incidents where the “well configured” assumption is not held true. Too often, concurrency is allowed to bloat until it exhausts resources and that in turn happens because equally often, the timeout is set as if it were a prayer instead of a protective limit.
Once I identified this failure pattern, I though of doing something about it.
First Attempt: smartcb (2017)
My first serious attempt was smartcb, built in October 2017. The insight was that circuit breakers should learn from observed error rates rather than relying on static thresholds.
smartcb wrapped the popular rubyist/circuitbreaker library and added a learning phase. It observed the baseline error profile of protected tasks and adjusted tripping thresholds using exponential weighted moving averages. I implemented Adjusted Wald confidence intervals for making statistically valid decisions with limited samples.
st := smartcb.NewSmartTripper(taskQPS, smartcb.NewPolicies())
scb := smartcb.NewSmartCircuitBreaker(st)
The approach worked better than static thresholds. But smartcb was a wrapper around an existing library, constrained by its design decisions. The configuration still required specifying expected QPS. The learning phase was a separate mode rather than continuous adaptation. Most importantly, it did no concurrency limiting.
Anyway, the project served its purpose and I moved on. But the problem stayed with me.
Levee (December 2024)
After a gap of seven years, this topic stirred up in my mind again. Although, I can’t recall as of this writing what prompted the thought, but I suspect it has something to do with service meshes. And concurrency.
Levee was designed around a single configuration: the SLO. Levee figures out the rest of the operational configurations on its own.
slo := levee.SLO{
SuccessRate: 0.95,
Timeout: time.Millisecond * 500,
Warmup: time.Second * 300,
}
l := levee.NewLevee(slo)
What about Timeout and Warmup, you ask? Well, Timeout is
inconsequential as of now and is not used anywhere. Warmup is a
safety margin I added to disable stats gathering on initialisation but
I have some ideas to get rid of it. Either way, SLO is the only
parameter that is consequential to Levee’s operation.
The initial implementation landed on December 29, 2024. Basic functionality came together quickly. Concurrent access support followed the next day. Then I dithered on to other things for almost a year.
The Measurement Problem
One reason I didn’t return to Levee was this fundamental question: how do you prove that a self-tuning circuit breaker actually works better than a well-tuned static one?
I needed to simulate realistic traffic patterns and measure decisions against ground truth. This meant building a load generator that could produce months of synthetic traffic in seconds.
Building loadgen
This is where I leaned into Claude Code, an AI assistant from Anthropic, to help build a proper synthetic load generator. I had been dabbling with AI coding assistants since ChatGPT and Github Co-pilot had launched. But nothing clicked for my like Claude Code with Emacs, which I landed up with last month.
The generator needed statistical realism: Poisson arrival processes, heavy-tailed latency distributions matching real p50/p99 targets, configurable error rates. It needed to model multiple load phases for daily traffic patterns. Most importantly, it needed to run on simulated time so even days’ worth of traffic could be simulated in seconds.
Efficiency was critical since we would generate hundreds of millions of events per benchmark run. The final implementation generates 15+ million events per second with 12 bytes and 0.5 allocations per event.
loadgen became its own project, built for testing levee but useful beyond that for any simulation-based testing.
Simulation-First Development
With loadgen providing realistic traffic, I needed Levee to work with simulated time rather than wall clocks. This drove a significant API change: adding explicit timestamp parameters to all methods.
The original API wrapped protected calls:
state, err := l.Call(func() error {
return callBackend()
})
For simulation testing, I needed to control timing explicitly. This led to the out-of-band API:
stateChange, err := l.Start(timestamp)
// ... simulate the call ...
l.Success(endTimestamp, duration)
What started as a testing requirement turned out to have broader applications. Stream processors could use Levee with event-time semantics. Levee could be used as a shadow with an existing breaker to perform comparitive testing. The API became more flexible by accident.
The Prescient Breaker
Realistic traffic was only half the problem. I still needed ground truth.
The solution was the Prescient Breaker: an oracle that knows the future from load specs. It opens exactly when the backend becomes unhealthy and closes exactly when it recovers. Zero detection lag. Perfect decisions. Impossible in reality, but it provides the upper bound for evaluation.
Every circuit breaker decision could now be scored against perfection:
- BadTraffic: Requests allowed when Prescient was OPEN (letting bad traffic through)
- LostBusiness: Requests blocked when Prescient was CLOSED (blocking good traffic)
This was a good start but there was still something missing from just a count of requests. When a system is under stress, you could push a thousand requests through it as long as the RPS is low enough. Try to send a short burst of traffic, however, and you risk breaking the camel’s back.
So I decided that each bad request was worth 10 penalty points at 10x
RPS if each request was worth 1 penalty point at 1 RPS. So that led to
the RPS-squared scoring. I.e. score = RPS * Traffic = RPS * (RPS * t).
Arguably, this weighing applies to lost business traffic as well. I think protecting a 10x flash sale spike lasting a minute is more valuable than 10 minutes of BAU. This is how I score the benchmarks but if you disagree, you’ll find that Levee is still better at preventing lost business situations than a breaker that’s tuned for regular traffic.
The Cyber Monday Benchmark
To test the adaptability of Levee vs. regular static circuit breakers, I needed a benchmark scenario that would flex quite a bit. It wasn’t too hard to come up with a Cyber Monday sale as the ideal situation because that’s my lived nightmare! So I conjured up a 28 hour test mimicking a Cyber Monday gone not-so-well.
It starts with 4 hours of Sunday night calm, then a 15x traffic spike at midnight that overwhelms an under-provisioned backend. Retry storms make things worse before they get better. The rest of the day brings hourly sales events with 3-8x traffic spikes, a massive 48x spike at 10 AM, gradual database degradation, and a second major incident at 06 PM.
I tested Levee against two carefully tuned static circuit breakers. Static-BAU was optimized for normal traffic patterns. Static-Peak was optimized for high-traffic conditions. Both were derived from probabilistic analysis of expected failure patterns, something which only the most disciplined teams ever manage to do in the first place.
Results
Levee achieved over 2x better overall decision quality than either static configuration:
- Up to 10x faster incident detection than Static-Peak
- Up to 1.7x less unnecessary blocking than Static-BAU
==============================================================================================================
COMPARATIVE SUMMARY - All Candidates vs Prescient (Ideal)
==============================================================================================================
Candidate | Blocked | Allowed | Flap | FalseAlarm | LateDetect | BadTraffic | LostBusiness | TotalPenalty
--------------------------------------------------------------------------------------------------------------
Prescient | 773707 | 55741451 | 0 | 0 | 0 | 0 | 0 | 0
Levee | 4146182 | 52386678 | 74 | 11 | 0 | 2346 | 16699 | 19044
Static-Peak | 3850335 | 52664823 | 8 | 0 | 1 | 21585 | 7527 | 29111
Static-BAU | 4351238 | 52176701 | 38 | 10 | 0 | 13387 | 31706 | 45092
BadTraffic = √Σ(RPS²) when Prescient OPEN but CB allowed (lower = better)
LostBusiness = √Σ(RPS²) when Prescient CLOSED but CB blocked (lower = better)
==============================================================================================================
The static configurations demonstrated the fundamental trade-off that self-tuning eliminates. Static-BAU was responsive but unstable, generating false alarms during traffic spikes. Static-Peak was stable but slow, letting lots of bad traffic through while waiting to confirm incidents.
Levee adapted continuously and provided fast response without triggering too many false alarms as well as quick but not premature recovery with minimal flapping compared to the competitors.
Full Circle
The final Levee implementation brought back ideas from smartcb: Adjusted Wald confidence intervals for statistically valid decisions with limited samples. But now they operated continuously rather than in a separate learning phase. The system also added latency spike detection, ring buffer metrics for predictable memory usage, and state persistence for surviving restarts.
Eight years after smartcb, the circuit breaker tuning problem finally has a solution that I can accept.
What’s Next
Levee is ready for real world trials. The core is solid, the benchmarks are comprehensive, and the API is stable. I believe a network of levee instances would operate like a decentralised service mesh, with regard to resiliency, without the overhead of proxying, metrics aggregation, proliferation of yamls.
The code is at https://github.com/codemartial/levee. Take it for a test drive in your environment and let me know how it goes!