distributed systems
simulation
Published

January 21, 2026

This example demonstrates metastability caused by client retries using discrete event simulation powered by happy-simulator.

The demonstration is (probably) the simplest possible example of metastability: sustained degradation due to retries due to timeouts due to queue wait times.

Metastability

    ┌─────────────────────────────────────────────────────────────────────┐
    │                    METASTABLE FEEDBACK LOOP                          │
    └─────────────────────────────────────────────────────────────────────┘

                         External Load
                                    │
                                    ▼
                         ┌──────────────────┐
                         │   Total Load     │◄─────────────────┐
                         │ (external+retry) │                  │
                         └────────┬─────────┘                  │
                                  │                            │
                                  ▼                            │
                         ┌──────────────────┐                  │
                         │   Queue Depth    │                  │
                         │   Increases      │                  │
                         └────────┬─────────┘                  │
                                  │                            │
                                  ▼                            │
                         ┌──────────────────┐                  │
                         │    Latency       │                  │
                         │   Increases      │                  │
                         └────────┬─────────┘                  │
                                  │                            │
                                  ▼                            │
                         ┌──────────────────┐         ┌───────┴───────┐
                         │    Timeouts      │────────►│    Retries    │
                         │    Increase      │         │   Add Load    │
                         └──────────────────┘         └───────────────┘

Load Profile

    Rate (req/s)
    20 │              ╭────────╮
       │              │        │  SEVERE SPIKE (200%)
       │              │        │
    10 │──────────────┤────────│────────────────────────────────
       │  Capacity    │        │
     9 │   ╭──────────┤        ├──────────────╮
       │   │          │        │              │
     7 │   │          │        │              ╰──────╮
       │   │  (90%)   │        │ Return to    │       ╰──────╮
     5 │   │          │        │ 90%          │ Step   ╰─────────
     3 │   │          │        │              │ down
     0 └───┴──────────┴────────┴──────────────┴─────────────────→ Time(s)
       0   5         20       30             60    75    90   100

    Phase 1 (0-20s):   High utilization 9 req/s (90% utilization)
    Phase 2 (20-30s):  SPIKE to 20 req/s (200% - severe overload)
    Phase 3 (30-60s):  Return to 9 req/s - BUT retries prevent recovery!
    Phase 4 (60-75s):  Step down to 7 req/s (70%) - may still be stuck
    Phase 5 (75-90s):  Step down to 5 req/s (50%) - should recover
    Phase 6 (90-100s): Step down to 3 req/s (30%) - definitely recovers

Running the Simulation

Simulation code
import random
from collections import defaultdict
from dataclasses import dataclass
from typing import Generator

from happysimulator import (
    ConstantArrivalTimeProvider,
    Data,
    Duration,
    Entity,
    Event,
    EventProvider,
    FIFOQueue,
    Instant,
    Probe,
    Profile,
    QueuedResource,
    Simulation,
    Source,
)


@dataclass(frozen=True)
class MetastableLoadProfile(Profile):
    """Load profile designed to trigger metastable failure."""
    spike_start: float = 20.0
    spike_end: float = 30.0
    step_down_1_start: float = 60.0
    step_down_2_start: float = 75.0
    step_down_3_start: float = 90.0
    moderate_rate: float = 9.0
    spike_rate: float = 20.0
    step_down_1_rate: float = 7.0
    step_down_2_rate: float = 5.0
    step_down_3_rate: float = 3.0

    def get_rate(self, time: Instant) -> float:
        t = time.to_seconds()
        if t < self.spike_start:
            return self.moderate_rate
        if t < self.spike_end:
            return self.spike_rate
        if t < self.step_down_1_start:
            return self.moderate_rate
        if t < self.step_down_2_start:
            return self.step_down_1_rate
        if t < self.step_down_3_start:
            return self.step_down_2_rate
        return self.step_down_3_rate


class QueuedServer(QueuedResource):
    """A queued server with exponential service times."""

    def __init__(self, name: str, *, mean_service_time_s: float = 0.1, concurrency: int = 1):
        super().__init__(name, policy=FIFOQueue())
        self.mean_service_time_s = mean_service_time_s
        self.concurrency = concurrency
        self._in_flight: int = 0
        self.stats_processed: int = 0
        self.completion_times: list[Instant] = []
        self.service_times_s: list[float] = []

    def has_capacity(self) -> bool:
        return self._in_flight < self.concurrency

    def handle_queued_event(self, event: Event) -> Generator[float, None, list[Event]]:
        self._in_flight += 1
        service_time = random.expovariate(1.0 / self.mean_service_time_s)
        yield service_time, None
        self._in_flight -= 1
        self.stats_processed += 1
        self.completion_times.append(self.now)
        self.service_times_s.append(service_time)

        client: Entity = event.context.get("client")
        if client is None:
            return []

        completion = Event(
            time=self.now,
            event_type="Completion",
            target=client,
            context={
                "request_id": event.context.get("request_id"),
                "original_created_at": event.context.get("created_at"),
                "service_time_s": service_time,
            },
        )
        return [completion]


@dataclass
class InFlightRequest:
    request_id: int
    created_at: Instant
    attempt: int
    timeout_event_id: int


class RetryingClient(Entity):
    """Client that sends requests with timeout-based retries."""

    def __init__(self, name: str, *, server: Entity, timeout_s: float = 0.5, max_retries: int = 5):
        super().__init__(name)
        self.server = server
        self.timeout_s = timeout_s
        self.max_retries = max_retries
        self._in_flight: dict[int, InFlightRequest] = {}
        self._next_timeout_id: int = 0

        self.stats_requests_received: int = 0
        self.stats_attempts_sent: int = 0
        self.stats_completions: int = 0
        self.stats_timeouts: int = 0
        self.stats_retries: int = 0
        self.stats_gave_up: int = 0

        self.completion_times: list[Instant] = []
        self.latencies_s: list[float] = []
        self.attempts_per_request: list[int] = []
        self.timeout_times: list[Instant] = []
        self.retry_times: list[Instant] = []

    def latency_time_series_seconds(self) -> tuple[list[float], list[float]]:
        return [t.to_seconds() for t in self.completion_times], list(self.latencies_s)

    def goodput_time_series(self, bucket_size_s: float = 1.0) -> tuple[list[float], list[int]]:
        if not self.completion_times:
            return [], []
        buckets: dict[int, int] = defaultdict(int)
        for t in self.completion_times:
            bucket = int(t.to_seconds() / bucket_size_s)
            buckets[bucket] += 1
        sorted_buckets = sorted(buckets.keys())
        return [b * bucket_size_s for b in sorted_buckets], [buckets[b] for b in sorted_buckets]

    def timeout_time_series(self, bucket_size_s: float = 1.0) -> tuple[list[float], list[int]]:
        if not self.timeout_times:
            return [], []
        buckets: dict[int, int] = defaultdict(int)
        for t in self.timeout_times:
            bucket = int(t.to_seconds() / bucket_size_s)
            buckets[bucket] += 1
        sorted_buckets = sorted(buckets.keys())
        return [b * bucket_size_s for b in sorted_buckets], [buckets[b] for b in sorted_buckets]

    def retry_time_series(self, bucket_size_s: float = 1.0) -> tuple[list[float], list[int]]:
        if not self.retry_times:
            return [], []
        buckets: dict[int, int] = defaultdict(int)
        for t in self.retry_times:
            bucket = int(t.to_seconds() / bucket_size_s)
            buckets[bucket] += 1
        sorted_buckets = sorted(buckets.keys())
        return [b * bucket_size_s for b in sorted_buckets], [buckets[b] for b in sorted_buckets]

    def handle_event(self, event: Event) -> list[Event]:
        event_type = event.event_type
        if event_type == "NewRequest":
            return self._handle_new_request(event)
        elif event_type == "Completion":
            return self._handle_completion(event)
        elif event_type == "Timeout":
            return self._handle_timeout(event)
        return []

    def _handle_new_request(self, event: Event) -> list[Event]:
        self.stats_requests_received += 1
        request_id = event.context.get("request_id", self.stats_requests_received)
        return self._send_request(request_id, event.time, attempt=1)

    def _send_request(self, request_id: int, created_at: Instant, attempt: int) -> list[Event]:
        self.stats_attempts_sent += 1
        self._next_timeout_id += 1
        timeout_id = self._next_timeout_id

        self._in_flight[request_id] = InFlightRequest(
            request_id=request_id,
            created_at=created_at,
            attempt=attempt,
            timeout_event_id=timeout_id,
        )

        server_request = Event(
            time=self.now,
            event_type="Request",
            target=self.server,
            context={
                "request_id": request_id,
                "created_at": created_at,
                "attempt": attempt,
                "client": self,
            },
        )

        timeout_event = Event(
            time=self.now + Duration.from_seconds(self.timeout_s),
            event_type="Timeout",
            target=self,
            context={"request_id": request_id, "timeout_id": timeout_id},
        )

        return [server_request, timeout_event]

    def _handle_completion(self, event: Event) -> list[Event]:
        request_id = event.context.get("request_id")
        if request_id not in self._in_flight:
            return []
        in_flight = self._in_flight.pop(request_id)
        self.stats_completions += 1
        original_created_at = event.context.get("original_created_at", in_flight.created_at)
        latency_s = (event.time - original_created_at).to_seconds()
        self.completion_times.append(event.time)
        self.latencies_s.append(latency_s)
        self.attempts_per_request.append(in_flight.attempt)
        return []

    def _handle_timeout(self, event: Event) -> list[Event]:
        request_id = event.context.get("request_id")
        timeout_id = event.context.get("timeout_id")
        if request_id not in self._in_flight:
            return []
        in_flight = self._in_flight[request_id]
        if in_flight.timeout_event_id != timeout_id:
            return []
        self.stats_timeouts += 1
        self.timeout_times.append(event.time)
        del self._in_flight[request_id]
        if in_flight.attempt >= self.max_retries:
            self.stats_gave_up += 1
            return []
        self.stats_retries += 1
        self.retry_times.append(event.time)
        return self._send_request(request_id, in_flight.created_at, attempt=in_flight.attempt + 1)


class ClientRequestProvider(EventProvider):
    def __init__(self, client: RetryingClient, *, stop_after: Instant | None = None):
        self._client = client
        self._stop_after = stop_after
        self._request_id: int = 0

    def get_events(self, time: Instant) -> list[Event]:
        if self._stop_after is not None and time > self._stop_after:
            return []
        self._request_id += 1
        return [
            Event(
                time=time,
                event_type="NewRequest",
                target=self._client,
                context={"request_id": self._request_id},
            )
        ]


@dataclass
class SimulationResult:
    client: RetryingClient
    server: QueuedServer
    queue_depth_data: Data
    requests_generated: int
    profile: MetastableLoadProfile


def run_metastable_simulation(
    *,
    duration_s: float = 100.0,
    drain_s: float = 10.0,
    mean_service_time_s: float = 0.1,
    timeout_s: float = 0.5,
    max_retries: int = 5,
    probe_interval_s: float = 0.1,
    seed: int | None = 42,
) -> SimulationResult:
    if seed is not None:
        random.seed(seed)

    server = QueuedServer(name="Server", mean_service_time_s=mean_service_time_s)
    client = RetryingClient(name="Client", server=server, timeout_s=timeout_s, max_retries=max_retries)

    queue_depth_data = Data()
    queue_probe = Probe(
        target=server,
        metric="depth",
        data=queue_depth_data,
        interval=probe_interval_s,
        start_time=Instant.Epoch,
    )

    profile = MetastableLoadProfile()
    stop_after = Instant.from_seconds(duration_s)
    provider = ClientRequestProvider(client, stop_after=stop_after)
    arrival = ConstantArrivalTimeProvider(profile, start_time=Instant.Epoch)
    source = Source(name="Source", event_provider=provider, arrival_time_provider=arrival)

    sim = Simulation(
        start_time=Instant.Epoch,
        end_time=Instant.from_seconds(duration_s + drain_s),
        sources=[source],
        entities=[client, server],
        probes=[queue_probe],
    )
    sim.run()

    return SimulationResult(
        client=client,
        server=server,
        queue_depth_data=queue_depth_data,
        requests_generated=provider._request_id,
        profile=profile,
    )


# Run the simulation
result = run_metastable_simulation()

Results

Load Profile and Queue Depth

Latency and Goodput

Timeouts and Retries (Feedback Loop)

Distributions

Summary Statistics

============================================================
METASTABLE FAILURE SIMULATION RESULTS
============================================================

Configuration:
  Server capacity: 10 req/s (mean service time = 100ms)
  Client timeout: 500ms
  Max retries: 5

Load Profile:
  Moderate load: 9.0 req/s (90% utilization)
  Spike load: 20.0 req/s (200% utilization)
  Step-down rates: 7.0, 5.0, 3.0 req/s

Request Flow:
  External requests generated: 860
  Total attempts sent to server: 3906
  Server processed: 1036
  Successful completions: 109
  Timeouts: 3797
  Retries: 3046
  Gave up (max retries): 751

Key Metrics:
  Retry amplification: 4.54x
  Overall timeout rate: 97.2%

Queue Depth by Phase:
  Pre-spike (10-20s):              111.5
  During spike (20-30s):           684.2
  Post-spike at 9 req/s (35-60s):  1810.5
  Step-down 1 - 7 req/s (60-75s):  2437.3
  Step-down 2 - 5 req/s (75-90s):  2759.3
  Step-down 3 - 3 req/s (90-100s): 2920.2

Timeout Rate by Phase (timeouts/sec):
  Pre-spike (10-20s):              32.5
  During spike (20-30s):           91.9
  Post-spike at 9 req/s (35-60s):  45.0
  Step-down 1 - 7 req/s (60-75s):  36.0
  Step-down 2 - 5 req/s (75-90s):  26.0
  Step-down 3 - 3 req/s (90-100s): 16.5
============================================================

Interpretation

Metastable failure occurs when:

  1. The spike causes queue buildup and increased latency
  2. Increased latency causes timeouts
  3. Timeouts trigger retries, adding MORE load
  4. The retry load prevents the queue from draining
  5. Even after external load drops, the system stays degraded

Look for these signs in the results:

  • Queue depth remains high after spike ends
  • Timeout rate stays elevated after spike
  • Retry rate sustains additional load
  • System only recovers at significantly reduced external load

References