Your agent just discovered it can call hangar_call in a loop. It's doing exactly that. Twelve hundred requests in ninety seconds, all legitimate from the model's perspective — it's just being thorough. Your MCP server is on its knees. The provider it's hammering has started returning garbage. The agent, undeterred, keeps calling.
This is not a hypothetical. It's what happens when you give a sufficiently motivated LLM tool access and no ceiling.
Rate limiting is the ceiling. Here's how it works in MCP Hangar — two independent systems, different threat models, one coherent defense.
The Core Idea
MCP Hangar has two rate limiters. They exist for different reasons and protect different things.
System A — token bucket at the command bus level. Protects the entire API from runaway volume. Every tool call, every management command, every request passes through it.
System B — exponential backoff on auth. Protects authentication from brute-force. Lives in the enterprise auth layer, tracks per-IP failure counts, escalates lockout duration on repeat offenses.
They don't share state. They don't need to. Different attack surfaces, different responses.
The Token Bucket
System A uses a token bucket algorithm. The intuition: a bucket fills with tokens at a constant rate. Each request costs one token. If the bucket is empty, the request is rejected and the caller is told when to retry.
class TokenBucket:
def consume(self, tokens: int = 1) -> tuple[bool, float]:
with self._lock:
self._refill()
if self.tokens >= tokens:
self.tokens -= tokens
return True, 0.0
else:
needed = tokens - self.tokens
wait_time = needed / self.rate
return False, wait_time
Two parameters control behavior:
requests_per_second— steady-state throughput. Default: 10 rps.burst_size— maximum tokens the bucket can hold. Default: 20.
Burst allows a caller to fire twenty requests instantly after a quiet period, then get throttled to ten per second. This is intentional — legitimate clients have bursty patterns. A CI pipeline runs, calls a bunch of tools, goes quiet. Token bucket handles that gracefully. A runaway loop doesn't get the same courtesy.
The refill math is straightforward:
tokens = min(burst_size, tokens + elapsed_seconds * rate)
One detail worth noting: TokenBucket uses time.monotonic() for refill calculations. Monotonic time is immune to system clock adjustments. The RateLimitResult.reset_at field uses time.time() — unix timestamp required for HTTP headers. Two different clocks for two different purposes. Don't conflate them.
Where It Sits in the Request Path
Rate limiting is the first check. Not the second. Not after validation. First.
MCP Client → CommandBus.send(command)
│
▼
RateLimitMiddleware.__call__()
│
rate_limiter.consume(key)
│
┌───────┴────────┐
│ │
allowed rejected
│ │
next handler HTTP 429
+ X-RateLimit headers
+ Retry-After
The comment in the source is unambiguous: "Rate limit first (cheapest check) to reduce abuse surface." Validation is more expensive. Approval gate is more expensive. If a caller is going to be rejected, reject them before spending cycles on anything else.
The middleware generates standard headers on every response:
def to_headers(self) -> dict[str, str]:
headers = {
"X-RateLimit-Limit": str(self.limit),
"X-RateLimit-Remaining": str(max(0, self.remaining)),
"X-RateLimit-Reset": str(int(self.reset_at)),
}
if self.retry_after is not None and self.retry_after > 0:
headers["Retry-After"] = str(int(self.retry_after) + 1)
return headers
Retry-After is int(wait_time) + 1 — the extra second is a conservative buffer against clock skew between client and server. Small, but intentional.
Per-Tool Buckets
Global rate limiting catches volume abuse. Per-tool limiting catches targeted abuse — hammering one expensive operation while staying under the global threshold.
mcp_tool_wrapper takes a rate_limit_key callable that determines which bucket a tool call hits. Three helper functions ship with the library:
def key_global(*_: Any, **__: Any) -> str:
"""Rate limit key for globally-scoped tools."""
return "global"
def key_per_provider(provider: str, *_: Any, **__: Any) -> str:
"""Rate limit key scoped per provider."""
return f"provider:{provider}"
def key_hangar_call(provider: str, tool: str, *_: Any, **__: Any) -> str:
"""Rate limit key specialized for tool invocation (per provider)."""
return f"hangar_call:{provider}"
key_global puts all calls into one shared bucket — appropriate for tools that don't vary by target. key_per_provider gives each provider its own bucket — a slow provider can be throttled without penalizing others. key_hangar_call is the same shape but semantically distinct for tool invocations specifically.
The execution order inside mcp_tool_wrapper matters: rate limit → validate → approval gate → execute. Rate limit is still first.
Composite Limiting
CompositeRateLimiter chains multiple limiters with AND logic — all must allow the request. The most restrictive limit wins.
The use case: apply a global limit of 10 rps across all traffic, and a per-provider limit of 2 rps for an expensive provider. A caller gets 10 rps until they target that provider, at which point they get 2 rps. The global bucket and the provider bucket are separate — one caller's per-provider throttle doesn't bleed into other callers' global budget.
Zero cross-tenant leaking. Separate bucket, separate state.
The Brute-Force Problem
System B exists because authentication is a different attack surface. The concern isn't volume — it's repeated failure. Someone trying to guess credentials doesn't necessarily generate high request volume. They can be methodical, slow, and patient.
Exponential backoff makes patience expensive:
@dataclass
class AuthRateLimitConfig:
max_attempts: int = 10 # attempts before first lockout
window_seconds: int = 60 # sliding window
lockout_seconds: int = 300 # initial lockout: 5 minutes
lockout_escalation_factor: float = 2.0
max_lockout_seconds: int = 3600 # cap: 1 hour
The escalation math:
effective_lockout = min(base * factor^(lockout_count - 1), max)
In practice:
| Lockout | Duration |
|---|---|
| 1st | 5 min |
| 2nd | 10 min |
| 3rd | 20 min |
| 4th | 40 min |
| 5th+ | 1 hour (cap) |
After the fifth offense, every subsequent lockout costs an hour. The attacker can continue. The math stops caring.
One notable behavior: successful authentication calls record_success(ip), which does del self._trackers[ip]. The tracker is erased entirely. A legitimate user who fat-fingered their password ten times, got locked out, waited, then authenticated correctly — starts clean. No residual penalty for recovery.
Note: RateLimitUnlock with unlock_reason="success" is emitted only if the IP was actively locked at the time of successful auth. If a user accumulated failed attempts below the lockout threshold and then succeeded, the tracker is still cleared, but no event is emitted. The audit trail reflects lockout state transitions, not every tracker lifecycle event.
Domain Events
System B emits domain events on state transitions. These aren't logs — they're first-class events in the event stream, available to handlers downstream.
@dataclass
class RateLimitLockout(DomainEvent):
source_ip: str
lockout_duration_seconds: float
lockout_count: int
failed_attempts: int
@dataclass
class RateLimitUnlock(DomainEvent):
source_ip: str
lockout_count: int
unlock_reason: str # "expired" | "success" | "manual_clear" | "cleanup"
unlock_reason tells you why the lockout ended. "success" means legitimate auth while locked. "expired" means the lockout timer ran out. "manual_clear" means an operator intervened. "cleanup" means garbage collection swept an expired lockout.
If an IP hits lockout repeatedly and you're seeing "expired" as the unlock reason rather than "success", the attacker is waiting out each lockout and retrying. The escalation ladder is working as designed — each wait costs more. The event stream makes this pattern visible.
Observability
Two Prometheus metrics:
RATE_LIMIT_HITS_TOTAL = Counter(
name="mcp_hangar_rate_limit_hits",
description="Total number of rate limit decisions by result",
labels=["result"], # "allowed" | "rejected"
)
RATE_LIMIT_ACTIVE_BUCKETS = Gauge(
name="mcp_hangar_rate_limit_active_buckets",
description="Number of active rate limit token buckets",
)
mcp_hangar_rate_limit_hits_total{result="rejected"} rising sharply is the signal. Either a legitimate caller has a bug, or something is probing the surface. Either way, you want to know before it becomes a support ticket.
The metrics update is wrapped in try/except. If Prometheus is unavailable, rate limiting continues. The comment: "fault-barrier: metrics failure must not block rate limit enforcement." The defense mechanism is not contingent on the monitoring layer.
Memory Management
InMemoryRateLimiter runs cleanup every 60 seconds. Buckets unused for longer than cleanup_interval are removed. Without this, every unique rate limit key that ever existed accumulates in memory indefinitely.
One implementation detail: _maybe_cleanup is called inside _get_bucket, which itself runs under self._lock. The cleanup and the normal bucket operation share the same lock. This is safe — InMemoryRateLimiter uses threading.Lock (non-reentrant), and cleanup is never called recursively from within itself. AuthRateLimiter (System B) uses threading.RLock — reentrant, appropriate because its cleanup path is more complex and can be triggered from within the lock context.
No Redis required for single-instance deployments. The in-memory implementation is the default. Redis appears in integration tests as proof-of-concept for distributed deployments. If you're running multiple Hangar instances behind a load balancer, you'll need a shared backing store — but that's not the common case, and the in-memory default doesn't pretend otherwise.
Configuration
# Environment variables
MCP_RATE_LIMIT_RPS=10 # steady-state requests/second
MCP_RATE_LIMIT_BURST=20 # burst capacity
# Auth rate limiting (config.yaml)
auth:
rate_limit:
enabled: true
max_attempts: 10
window_seconds: 60
lockout_seconds: 300
Defaults are conservative. A single legitimate client shouldn't approach 10 rps under normal usage. If you're seeing legitimate traffic rejected, the first question is whether you have a client bug, not whether the defaults are too tight.
What This Actually Solves
Rate limiting doesn't stop a determined attacker. Nothing does, entirely. What it does is make attacks expensive, make abuse visible, and make recovery predictable.
The token bucket bounds damage from runaway agents. An LLM in a loop hits the ceiling, gets 429s, and — if the client is implemented correctly — backs off. If it isn't, you have a client bug to fix rather than a server to recover.
The exponential backoff makes credential stuffing economically unviable. Ten attempts per minute, escalating lockouts, an hour cap. At that rate, a meaningful credential space takes longer than anyone's operational patience.
The domain events make both attack patterns observable after the fact. Not just "something was rejected" — but what, from where, how many times, and whether it stopped.
That's the real value. Not prevention as a guarantee, but prevention as a measurable, auditable property of the system.
MCP Hangar is open source. The MIT core is at github.com/mcp-hangar/mcp-hangar. System B (auth rate limiting) lives in the BSL-licensed enterprise layer — source-available, auditable.