Token Bucket Rate Limiter

Watch requests flow through the gate as tokens are consumed and refilled

ClientServer10/10TokensIncoming RequestsToken BucketProcessedRejected

Controls

5/sec
120
10 tokens
530
3/sec
115

Keyboard: Space play/pause · R reset · B burst

Accepted

0

Rejected

0

Tokens

10/10

Throughput

0/sec

What is Rate Limiting?

Rate limiting controls how many requests a client can make to a service. It protects servers from overload, ensures fair resource sharing, prevents abuse, and manages costs.

The Token Bucket Algorithm

Token bucket allows controlled bursts while enforcing an average rate. This fits real traffic patterns well since requests often arrive in bursts.

How it works:

  1. Bucket starts with capacity tokens (e.g., 10)
  2. Tokens refill at rate per second (e.g., 5/sec)
  3. Each request consumes 1 token (or N for weighted requests)
  4. Request accepted if tokens available, otherwise rejected
  5. Bucket never exceeds capacity (excess tokens discarded)

Capacity defines burst tolerance, refill rate defines sustained throughput. A bucket with capacity=100 and rate=10/sec allows a burst of 100 requests, then sustains 10/sec.

Why Token Bucket?

Advantages

  • • Handles bursty traffic gracefully
  • • O(1) time and space per limiter
  • • Simple to implement and reason about
  • • Easy to tune with two parameters
  • • No request queuing needed

Considerations

  • • Bursts can overwhelm downstream
  • • Doesn't smooth traffic like leaky bucket
  • • Requires atomic operations in distributed systems
  • • Memory per unique client in multi-tenant

Algorithm Comparison

AlgorithmMechanismProsCons
Token BucketTokens refill over time, consumed per requestAllows bursts, O(1), memory efficientBursts may overwhelm downstream
Leaky BucketQueue requests, process at fixed rateSmooth output, prevents burstsAdds latency, queue memory O(n)
Fixed WindowCounter resets each time windowSimple, O(1), easy to implement2x burst at window edges
Sliding Window LogStore timestamp of each requestPrecise, no boundary issuesO(n) time and space
Sliding Window CounterWeighted avg of adjacent windowsO(1), smoother than fixed windowApproximate, not exact count

Distributed Systems Considerations

In production, rate limiters typically run across multiple nodes. This introduces challenges:

Synchronization Strategy

Use a centralized store (Redis) with Lua scripts for atomic read-modify-write. Accept ~1-5ms latency per request for accuracy.

Failure Handling

Fail open (allow requests) when Redis is unavailable. Use local in-memory fallback with shorter windows to prevent total outage from becoming DoS.

Race Conditions

Lua scripts execute atomically in Redis. For other stores, use optimistic locking with CAS (Compare-And-Swap) operations.

Production Best Practices

  • Response:Return 429 Too Many Requests with Retry-After header
  • Headers:Include X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset
  • Granularity:Layer limits: global → per-tenant → per-user → per-endpoint
  • Monitoring:Track rejection rate, p99 latency impact, top rejected clients
  • Placement:Apply at edge (API Gateway, CDN) to reject early and protect origin

Real-World Examples

AWS API GatewayToken Bucket
StripeToken Bucket
ShopifyLeaky Bucket