---
title: "Traffic Shaping with Digital Inductors"
description: "Distributed system burst suppression and rate limiting, with less configuration."
author: "Adam Fillion"
date: "2026-03-21"
categories: [distributed-systems, software-engineering]
draft: false
---
```{python}
#| label: setup
#| code-fold: true
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
np.random.seed(42)
# Bursty traffic signal shared across visualizations
duration = 30.0
dt_step = 0.05
t = np.arange(0, duration, dt_step)
def _traffic_rate(tv):
if tv < 5: return 0
elif tv < 8: return 80
else: return 0
traffic = np.array([_traffic_rate(ti) for ti in t])
```
Traffic shaping is a core topic in distributed systems. If your system has users, or if it's built out of imperfect physical systems (spoiler: it is) then it's exposed to chaos that needs to be contained (traffic shaped).
Chaos is a regional outage from an asteroid, or a flood, or a war, or it's the Superbowl halftime show, and everything in between. The invariant: you don't control it.
This chaos tends to bump into practical limits of the system. When this happens, systems tend to degrade. Increased latency, reduced throughput, failures, whatever. Engineers scramble around to draw boxes around the chaos to constrain it. In practice, this looks like rate limiters, throttlers, load shedders, quotas, and other statically configured components sprinkled across the distributed system.
Years pass, engineers come and go, and the system ends up with a huge number of limits that need to be continuously maintained. An unfortunate, albeit worthy, tradeoff for resilience.
During my time at AWS, many of the availability issues I investigated were caused by millisecond scale bursts. For a short period of time, the system would burst to 10x its usual capacity, exhaust any limits we (or our dependencies) had, and return failures or [worse](https://adamfillion.com/appendix/metastability/). At the very least, there was a necessity to provision around these bursts, often resulting in double or triple the capacity requirements to maintain multiple-9's availability targets.
Identifying different burst modes and limiting each and every one of them sounds appealing at first, until you multiply the service out to dozens of regions, each with different scale, and the dynamic nature of the service itself, and you realize you will need to not only come up with a sensible limit today, but for all days in the future. Usually, this doesn't happen, and you end up with codebases of configuration that are stale and sub-optimal, but too risky to change without thorough analysis.
To add some realism to this discussion, let's look at how different rate limiters can shape an input rate:
```{ojs}
//| label: rate-limiter-controls
//| echo: false
viewof config = {
function slider(min, max, value, step, label) {
const id = `rl-${label.replace(/\W/g, '')}`;
const el = html`<div style="display:flex;align-items:center;gap:4px;margin:1px 0;">
<label for="${id}" style="font-size:10px;color:#666;min-width:55px;white-space:nowrap;">${label}</label>
<input id="${id}" type="range" min="${min}" max="${max}" value="${value}" step="${step}"
style="flex:1;height:12px;margin:0;cursor:pointer;">
<span style="font-size:10px;color:#888;min-width:24px;text-align:right;">${value}</span>
</div>`;
const inp = el.querySelector("input");
const out = el.querySelector("span");
inp.addEventListener("input", () => { out.textContent = inp.value; });
el._input = inp;
return el;
}
const tb1 = slider(5, 80, 30, 1, "Fill Rate");
const tb2 = slider(5, 100, 20, 1, "Capacity");
const sw1 = slider(0.5, 10, 2, 0.5, "Window");
const sw2 = slider(10, 200, 60, 5, "Max Reqs");
const lb1 = slider(5, 80, 30, 1, "Drain Rate");
const form = html`<div style="display:grid;grid-template-columns:1fr 1fr 1fr;gap:6px 12px;margin-bottom:4px;">
<div>
<div style="font-weight:600;color:#10b981;font-size:11px;margin-bottom:2px;">Token Bucket</div>
${tb1}${tb2}
</div>
<div>
<div style="font-weight:600;color:#f59e0b;font-size:11px;margin-bottom:2px;">Sliding Window</div>
${sw1}${sw2}
</div>
<div>
<div style="font-weight:600;color:#ef4444;font-size:11px;margin-bottom:2px;">Leaky Bucket</div>
${lb1}
</div>
</div>`;
function update() {
form.value = {
tbFillRate: +tb1._input.value,
tbCapacity: +tb2._input.value,
swWindow: +sw1._input.value,
swLimit: +sw2._input.value,
lbDrainRate: +lb1._input.value,
};
form.dispatchEvent(new Event("input", {bubbles: true}));
}
for (const s of [tb1, tb2, sw1, sw2, lb1])
s._input.addEventListener("input", update);
update();
return form;
}
```
```{ojs}
//| label: rate-limiter-chart
//| code-fold: true
//| code-summary: "Show rate limiter simulation code"
{
const dt = 0.05;
const duration = 30;
const steps = Math.floor(duration / dt);
// Generate traffic: silence, a single burst, then silence
let times = [], input = [];
for (let i = 0; i < steps; i++) {
let t = i * dt;
times.push(t);
let rate;
if (t < 5) rate = 0;
else if (t < 8) rate = 80;
else rate = 0;
input.push(rate);
}
// Token Bucket: refills tokens at constant rate, spends one per request.
// Stored tokens allow short bursts through; excess is queued and
// released as tokens refill — no traffic is dropped.
let tbTokens = config.tbCapacity;
let tbQueue = 0;
let tbOut = [];
for (let i = 0; i < steps; i++) {
let arriving = input[i] * dt;
tbTokens = Math.min(config.tbCapacity, tbTokens + config.tbFillRate * dt);
let fromQueue = Math.min(tbQueue, tbTokens);
tbQueue -= fromQueue;
tbTokens -= fromQueue;
let passed = Math.min(arriving, tbTokens);
tbTokens -= passed;
tbQueue += (arriving - passed);
tbOut.push((fromQueue + passed) / dt);
}
// Sliding Window: counts served requests in a rolling time window.
// Serves up to the limit per window; excess is queued and released
// as the window slides — no traffic is dropped.
let swServed = [];
let swQueue = 0;
let swOut = [];
for (let i = 0; i < steps; i++) {
let arriving = input[i] * dt;
let windowSteps = Math.floor(config.swWindow / dt);
let inWindow = 0;
for (let j = Math.max(0, i - windowSteps); j < i; j++) {
inWindow += swServed[j];
}
let canServe = Math.max(0, config.swLimit - inWindow);
let fromQueue = Math.min(swQueue, canServe);
swQueue -= fromQueue;
canServe -= fromQueue;
let passed = Math.min(arriving, canServe);
swQueue += (arriving - passed);
let totalServed = fromQueue + passed;
swServed.push(totalServed);
swOut.push(totalServed / dt);
}
// Leaky Bucket: all traffic enters an unbounded queue that drains at
// a constant rate — no traffic is dropped.
let lbQueue = 0;
let lbOut = [];
for (let i = 0; i < steps; i++) {
let arriving = input[i] * dt;
lbQueue += arriving;
let drained = Math.min(lbQueue, config.lbDrainRate * dt);
lbQueue -= drained;
lbOut.push(drained / dt);
}
// All series on a single chart
let data = [];
for (let i = 0; i < steps; i++) {
data.push({t: times[i], rate: input[i], type: "Input Rate"});
data.push({t: times[i], rate: tbOut[i], type: "Token Bucket"});
data.push({t: times[i], rate: swOut[i], type: "Sliding Window"});
data.push({t: times[i], rate: lbOut[i], type: "Leaky Bucket"});
}
return Plot.plot({
marks: [
Plot.areaY(data.filter(d => d.type === "Input Rate"), {
x: "t", y: "rate", fill: "#6366f1", fillOpacity: 0.08
}),
Plot.lineY(data.filter(d => d.type === "Input Rate"), {
x: "t", y: "rate", stroke: "#6366f1", strokeWidth: 2
}),
Plot.lineY(data.filter(d => d.type !== "Input Rate"), {
x: "t", y: "rate", stroke: "type", strokeWidth: 2,
strokeOpacity: 0.6, strokeDasharray: "6 3"
}),
Plot.ruleY([0])
],
color: {
domain: ["Input Rate", "Token Bucket", "Sliding Window", "Leaky Bucket"],
range: ["#6366f1", "#10b981", "#f59e0b", "#ef4444"],
legend: true
},
x: {label: "Time (arbitrary unit)"},
y: {label: "Rate (arbitrary unit)", domain: [0, 100]},
width: width,
height: 400,
marginLeft: 50
});
}
```
The configurations need to be maintained throughut the lifecyle of the system the component is deployed to. If all we wanted to do was soften bursts, we can deploy one of these components but we inherit the configuration overhead as well, since as the system changes your definition of "burst" will also likely change.
For this usecase, it would be nice if there existed a software component that had a definition "burst" that was derived from an invariant (or close to) property of the system.
The best place to start is with an idealized notion of how this component should behave:
```{ojs}
//| label: idealized-controls
//| echo: false
viewof tau = {
const id = "rl-tau";
const el = html`<div style="display:flex;align-items:center;gap:4px;max-width:320px;margin:1px 0;">
<label for="${id}" style="font-size:10px;color:#666;min-width:55px;white-space:nowrap;">Reactivity</label>
<input id="${id}" type="range" min="0.5" max="10" value="3" step="0.25"
style="flex:1;height:12px;margin:0;cursor:pointer;">
<span style="font-size:10px;color:#888;min-width:24px;text-align:right;">3</span>
</div>`;
const inp = el.querySelector("input");
const out = el.querySelector("span");
inp.addEventListener("input", () => {
out.textContent = inp.value;
el.value = +inp.value;
el.dispatchEvent(new Event("input", {bubbles: true}));
});
el.value = +inp.value;
return el;
}
```
```{ojs}
//| label: idealized-behavior
//| code-fold: true
//| code-summary: "Show burst suppression simulation code"
{
const dt = 0.05;
const duration = 30;
const steps = Math.floor(duration / dt);
// Same input signal as the rate limiter chart
let times = [], input = [];
for (let i = 0; i < steps; i++) {
let t = i * dt;
times.push(t);
if (t < 5) input.push(0);
else if (t < 8) input.push(80);
else input.push(0);
}
// EWMA burst suppressor: smooths the rate estimate, queues events
// when the instantaneous rate exceeds the smoothed rate.
// tau comes from the slider above.
let smoothedRate = 0;
let queue = 0;
let bsOut = [];
for (let i = 0; i < steps; i++) {
let arriving = input[i] * dt;
let currentRate = input[i];
// Time-aware EWMA smoothing factor (same α from the article)
let alpha = 1 - Math.exp(-dt / tau);
smoothedRate = alpha * currentRate + (1 - alpha) * smoothedRate;
// Queue excess above the smoothed rate, drain queue at smoothed rate
queue += arriving;
let canServe = smoothedRate * dt;
let served = Math.min(queue, canServe);
queue -= served;
bsOut.push(served / dt);
}
let data = [];
for (let i = 0; i < steps; i++) {
data.push({t: times[i], rate: input[i], type: "Input Rate"});
data.push({t: times[i], rate: bsOut[i], type: "Allowed Output Rate"});
}
return Plot.plot({
marks: [
Plot.areaY(data.filter(d => d.type === "Input Rate"), {
x: "t", y: "rate", fill: "#6366f1", fillOpacity: 0.08
}),
Plot.lineY(data.filter(d => d.type === "Input Rate"), {
x: "t", y: "rate", stroke: "#6366f1", strokeWidth: 2
}),
Plot.lineY(data.filter(d => d.type === "Allowed Output Rate"), {
x: "t", y: "rate", stroke: "#10b981", strokeWidth: 2,
strokeOpacity: 0.7, strokeDasharray: "6 3"
}),
Plot.ruleY([0])
],
color: {
domain: ["Input Rate", "Allowed Output Rate"],
range: ["#6366f1", "#10b981"],
legend: true
},
x: {label: "Time (arbitrary unit)"},
y: {label: "Rate (arbitrary unit)", domain: [0, 100]},
width: width,
height: 350,
marginLeft: 50
});
}
```
As the input instaneously jumps, the allowed rate ramps up slowly, reducing the peak load on the system by ~40% in this case.
The only configuration needed for a component like this is a single constant which represents the reactivity of the system. This will be closely related to the "time scale" of the system, and likely won't change throughout its lifecycle. Low latency systems will have higher reactivity to bursts, and higher latency systems will have lower reactivity.
This component would ideally have minimal memory and CPU footprint and handle any scale, so that it can be "sprinkled" throughout the codebase with minimal analysis.
My formal education is in electrical engineering, so naturally an inductor seems like the right solution here: a "memoryless" (quantum experts might disagree) component that suppresses bursts of electrons, while being invisible to consistent streams (DC).
In the digital realm, we can do the same thing: track a smoothed estimate of the current event rate, and when events arrive faster than the estimate, queue them and release at the smoothed rate. Events that arrive at or below the expected rate pass straight through — the component is invisible to steady-state traffic, just like an inductor is invisible to DC.
The core of the solution is this exponentially weighted rate estimator: `α = 1 - exp(-dt / τ)`.
::: {.callout-tip collapse="true"}
## Full pseudocode
```
algorithm INDUCTOR_BURST_SUPPRESSION:
state:
smoothed_interval ← null # EWMA of inter-arrival time
last_arrival_time ← null
last_output_time ← null
queue ← [] # bounded FIFO
parameter:
τ # time constant (seconds)
on_arrival(event, now):
if last_arrival_time ≠ null:
dt ← now - last_arrival_time
if smoothed_interval = null:
smoothed_interval ← dt
else:
α ← 1 - exp(-dt / τ)
smoothed_interval ← α · dt + (1 - α) · smoothed_interval
last_arrival_time ← now
if can_forward(now):
forward(event, now)
else if queue is not full:
queue.push(event)
schedule_poll(now + smoothed_interval)
else:
drop(event)
can_forward(now):
if last_output_time = null:
return true
return (now - last_output_time) ≥ smoothed_interval
forward(event, now):
last_output_time ← now
send event downstream
```
:::
This is too academic for most, so lets do some discrete analysis using [happy simulator](https://github.com/adamfilli/happy-simulator/tree/main):
```{mermaid}
%%| fig-width: 6
flowchart LR
A["Input Events<br/>(variable rate)"] --> B["EWMA<br/>Estimator"]
B -->|"smoothed rate"| C{"rate ><br/>estimate?"}
C -->|yes| D["Queue<br/>(bounded FIFO)"]
C -->|no| F["Output<br/>(shaped rate)"]
D -->|"drain at<br/>smoothed rate"| F
```
```{python}
#| label: simulation
#| code-fold: true
from dataclasses import dataclass
from happysimulator import Simulation, Source, Event, Entity, Instant
from happysimulator.components.rate_limiter.inductor import Inductor
from happysimulator.instrumentation.collectors import ThroughputTracker
from happysimulator.instrumentation.data import Data
from happysimulator.instrumentation.probe import Probe
from happysimulator.load.profile import Profile
class InputCounter(Entity):
"""Counts arrivals then forwards downstream."""
def __init__(self, name: str, downstream: Entity) -> None:
super().__init__(name)
self.data = Data()
self._downstream = downstream
def handle_event(self, event: Event) -> list[Event]:
self.data.add_stat(1.0, event.time)
return [self.forward(event, self._downstream)]
@dataclass(frozen=True)
class InductorShowcaseProfile(Profile):
base_rate: float = 10.0
def get_rate(self, time: Instant) -> float:
t = time.to_seconds()
if t < 10.0:
return self.base_rate
if t < 28.0:
return 50.0
if t < 35.0:
return self.base_rate
if t < 60.0:
fraction = (t - 35.0) / 25.0
return self.base_rate + fraction * (50.0 - self.base_rate)
if t < 67.0:
return self.base_rate
if 70.0 <= t < 75.0 or 79.0 <= t < 84.0:
return 80.0
return self.base_rate
output_tracker = ThroughputTracker("OutputTracker")
inductor = Inductor("Inductor", downstream=output_tracker, time_constant=2.0)
input_counter = InputCounter("InputCounter", downstream=inductor)
source = Source.with_profile(
profile=InductorShowcaseProfile(),
target=input_counter,
event_type="Request",
poisson=False,
name="LoadGenerator",
)
rate_probe, rate_data = Probe.on(inductor, "estimated_rate", interval=0.1)
depth_probe, depth_data = Probe.on(inductor, "queue_depth", interval=0.1)
sim = Simulation(
sources=[source],
entities=[input_counter, inductor, output_tracker],
probes=[rate_probe, depth_probe],
duration=90.0,
)
summary = sim.run()
print(f"Simulated {summary.total_events_processed} events in {summary.wall_clock_seconds:.2f}s")
print(f"Inductor stats: {inductor.stats}")
```
```{python}
#| label: throughput-chart
#| code-fold: true
# Input vs Output throughput
input_rate = input_counter.data.rate(window_s=1.0)
output_rate = output_tracker.data.rate(window_s=1.0)
fig = go.Figure()
fig.add_trace(go.Scatter(
x=input_rate.times(), y=input_rate.raw_values(),
mode='lines', name='Input',
line=dict(color='#6366f1', width=1.5),
))
fig.add_trace(go.Scatter(
x=output_rate.times(), y=output_rate.raw_values(),
mode='lines', name='Output (after inductor)',
line=dict(color='#10b981', width=2),
))
# Phase annotations
fig.add_vrect(x0=0, x1=28, fillcolor="#f0f0ff", opacity=0.3, line_width=0,
annotation_text="Phase 1: Step-up", annotation_position="top left")
fig.add_vrect(x0=35, x1=60, fillcolor="#f0fff0", opacity=0.3, line_width=0,
annotation_text="Phase 2: Ramp", annotation_position="top left")
fig.add_vrect(x0=67, x1=90, fillcolor="#fff0f0", opacity=0.3, line_width=0,
annotation_text="Phase 3: Periodic bursts", annotation_position="top left")
fig.update_layout(
title="Input vs Output Throughput",
xaxis_title="Time (arbitrary unit)",
yaxis_title="Throughput (req/s)",
height=450,
legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99),
)
fig.show()
```
```{python}
#| label: ewma-chart
#| code-fold: true
# EWMA rate estimate — the inductor's internal state
ewma_bucketed = rate_data.bucket(window_s=0.5)
fig = go.Figure()
fig.add_trace(go.Scatter(
x=ewma_bucketed.times(), y=ewma_bucketed.means(),
mode='lines', name='EWMA Rate Estimate',
line=dict(color='#f59e0b', width=2),
))
fig.add_vrect(x0=0, x1=28, fillcolor="#f0f0ff", opacity=0.3, line_width=0)
fig.add_vrect(x0=35, x1=60, fillcolor="#f0fff0", opacity=0.3, line_width=0)
fig.add_vrect(x0=67, x1=90, fillcolor="#fff0f0", opacity=0.3, line_width=0)
fig.update_layout(
title="EWMA Rate Estimate (internal state)",
xaxis_title="Time (arbitrary unit)",
yaxis_title="Estimated rate (req/s)",
height=350,
showlegend=False,
)
fig.show()
```
In Phase 1, the inductor smoothly ramps up output throughput rather than instantly passing through the 5x spike. The queue absorbs the burst while the EWMA estimate gradually adapts.
In Phase 2, the linear ramp passes through almost unimpeded — the rate of change is slow enough to go mostly unoticed by the inductor.
In Phase 3, the two bursts are close enough that the EWMA retains memory from the first burst when the second arrives. The inductor reacts faster to the second spike because it hasn't fully decayed back to baseline.
Admittedly, I haven't implemented this at a large scale anywhere in practice. I think this may be useful if used in conjunction with distributed throttlers - which are often required to provide deterministic limits to external users, but have convergence times that allow bursts to sneak through and wreak havoc.