System Design: Building a Rate Limiter That Survives Scale

A practical architecture for distributed rate limiting with fairness, burst control, and multi-tenant isolation.

Dec 5, 20235 min read

Why rate limiting is a system design concern

Rate limiting is not just API protection. It shapes fairness, cost control, and reliability under traffic spikes. A weak limiter either blocks valid users or allows abusive traffic to consume shared resources. A good limiter answers three things:

  • Who is being limited?
  • What window and quota apply?
  • How do we enforce this consistently across instances?

Algorithm choices

Common strategies have clear trade-offs.

  • Fixed window: simple, but boundary burst problem
  • Sliding log: accurate, but expensive at high volume
  • Sliding window counter: good balance for most APIs
  • Token bucket: best for controlled bursts

In practice, token bucket plus per-route policy works well for product APIs.

Distributed architecture

A single-instance in-memory limiter fails once traffic is load-balanced. Use a shared fast store (commonly Redis) with atomic operations.

  • Key format: tenant:user:route
  • Value: tokens + last refill timestamp
  • Enforcement: atomic Lua script to avoid race conditions

Keep script latency low and avoid large key cardinality explosions.

Multi-tenant fairness

If all tenants share one global quota, noisy neighbors win. Define limits at multiple levels.

  • Global service limit
  • Tenant-level limit
  • User-level or token-level limit
  • Route-level overrides for expensive endpoints

This layered model keeps premium and free traffic predictable.

Handling partial outages

Limiter dependency failure should not blindly take your API down.

  • Fail-open for non-critical endpoints
  • Fail-closed for sensitive endpoints
  • Local emergency budget cache for short redis disruptions

Document these behaviors clearly so on-call decisions are consistent.

Metrics that matter

  • Allowed vs blocked request ratio
  • p95 limiter decision latency
  • Top keys by block count
  • Redis script error and timeout rate

Without these metrics, tuning limits becomes guesswork.

Reference architecture

A production setup usually includes policy evaluation near the API edge.

  • API gateway receives request and extracts identity context
  • Limiter service evaluates quota policy and key hierarchy
  • Redis executes atomic token-bucket logic via Lua
  • Decision headers (<code>X-RateLimit-*</code>) are returned to clients
  • Central policy store controls per-plan and per-route limits

Keeping policy and enforcement separated makes tuning safer.

Choosing limits without harming UX

Start with product behavior, not random numeric limits.

  • Interactive endpoints: lower burst and tighter sustained rate
  • Background ingestion APIs: larger burst but strict sustained budget
  • Authentication endpoints: strict limits with stronger abuse controls
  • Internal service calls: service-account specific limits

Run limits in observe mode first and inspect affected traffic cohorts.

Hardening patterns

As traffic grows, these controls prevent noisy outages.

  • Hot-key mitigation by sharding high-volume identities
  • Per-region local limiting plus global safety budget
  • Circuit-breaker behavior when Redis latency spikes
  • Separate quotas for expensive operations (search, exports, AI calls)

Rate limiting should evolve as product usage evolves.

Common product mistakes

Many problems come from policy design, not the limiter algorithm itself.

  • One limit for every endpoint
  • No difference between free and paid users
  • Limits chosen without observing normal user behavior
  • No useful response headers for clients

These mistakes create frustration even when the limiter is technically correct.

A good rollout strategy

Introduce new limits in observe mode first whenever possible.

  • Measure who would be blocked
  • Review noisy tenants and client bugs
  • Adjust burst and sustained limits separately
  • Turn on enforcement only after traffic looks healthy

This makes rate limiting feel predictable instead of random to users and teams.

Final takeaway

Rate limiting is a first-class architecture component. Treat it like core infrastructure: explicit policies, atomic distributed enforcement, and clear outage behavior.