Back to Case StudyCase Study · API Infrastructure

API Gateway Rate Limiter: Per-Tenant Limiting at 30K+ Req/Day with Sub-5ms Overhead

How I built a token-bucket rate limiting layer backed by atomic Redis Lua scripts, with circuit breaking and distributed tracing — zero incidents, sub-5ms P99 overhead across all tenants.

NginxRedisGoOpenTelemetryLuaKubernetes
30K+
Req/day throughput
<5ms
P99 limiter overhead
0
Production incidents

Problem & Constraints

A multi-tenant API platform needs per-tenant rate limiting that is consistent across all API gateway replicas, adds negligible latency overhead, and fails safely when the rate limiting backend is unavailable. The naive approach — a counter per tenant in a single Redis key — works but doesn't account for burst tolerance, clock drift across pods, or the thundering-herd problem when limits reset.

Requirements: per-tenant token bucket with configurable capacity and refill rate, sub-5ms P99 overhead on the limiter check itself, graceful degradation if Redis is unavailable (fail open with alerting), and full observability via OpenTelemetry traces.

Architecture Overview

The rate limiter runs as a Go middleware in the API gateway. Each incoming request carries a tenant ID (from the JWT or API key). The middleware executes a Redis Lua script that atomically reads the current token count, calculates the refill since the last request, clamps to bucket capacity, and either consumes a token or returns a rejection — all in a single Redis round-trip.

Tenant configurations (bucket capacity, refill rate) are stored in PostgreSQL and cached in-memory with a 60-second TTL. Cache misses hit PostgreSQL; a stale cache is tolerable because configuration changes are rare and take effect on next cache expiry.

Key Design Decisions

Token bucket over sliding window

Why: Token bucket allows short bursts up to bucket capacity while enforcing a steady average rate. This matches real API usage patterns — a client may fire 5 requests in 500ms then idle for 10 seconds. Sliding window would reject all 5.

Alternative: Sliding window: more precise rate enforcement but penalizes legitimate burst usage.

Lua script for atomic token check-and-consume

Why: A Lua script runs atomically on the Redis server — the read-modify-write cycle has no race condition. A non-atomic approach (GET then SET) allows two concurrent requests to both see a non-empty bucket and both consume a token.

Alternative: Redis transactions (MULTI/EXEC): achievable but more round-trips and complex error handling.

Fail-open on Redis unavailability

Why: The cost of a legitimate request being incorrectly rejected (lost revenue, broken user experience) is higher than the cost of a rate-limited tenant briefly exceeding their quota during a Redis outage. The outage itself triggers alerts and is resolved quickly.

Alternative: Fail-closed: safer against abuse but causes service disruption for all tenants during Redis downtime.

In-memory config cache with 60s TTL

Why: Rate limit configuration changes are rare operator actions. A 60-second delay is acceptable. Avoiding a PostgreSQL query per request reduces latency by ~3ms at median.

Alternative: Redis config cache: shares Redis's availability profile; PostgreSQL fallback is then lost if Redis is down.

Token Bucket Lua Script

token_bucket.lua
-- Token bucket rate limiter-- KEYS[1]: rate limit key (e.g., "ratelimit:tenant:t123")-- ARGV[1]: bucket capacity (max tokens)-- ARGV[2]: refill rate (tokens per second)-- ARGV[3]: current timestamp (milliseconds)local key          = KEYS[1]local capacity     = tonumber(ARGV[1])local refill_rate  = tonumber(ARGV[2])local now          = tonumber(ARGV[3])local data         = redis.call("HMGET", key, "tokens", "last_refill")local tokens       = tonumber(data[1]) or capacitylocal last_refill  = tonumber(data[2]) or now-- Refill tokens based on elapsed timelocal elapsed      = math.max(0, (now - last_refill) / 1000.0)local new_tokens   = math.min(capacity, tokens + elapsed * refill_rate)if new_tokens < 1.0 then  redis.call("HMSET", key, "tokens", new_tokens, "last_refill", now)  redis.call("PEXPIRE", key, math.ceil(capacity / refill_rate) * 1000)  return {0, math.floor(new_tokens * 1000)}  -- {rejected, tokens_milli}endnew_tokens = new_tokens - 1.0redis.call("HMSET", key, "tokens", new_tokens, "last_refill", now)redis.call("PEXPIRE", key, math.ceil(capacity / refill_rate) * 1000)return {1, math.floor(new_tokens * 1000)}  -- {allowed, tokens_milli}

Tradeoffs

The token bucket implementation stores tokens as a float. Floating-point arithmetic in Lua is deterministic on a given Redis version, but upgrades require regression testing the bucket behavior. An integer-based token representation (tokens * 1000 as an integer) would eliminate this risk at the cost of slightly reduced precision.

Per-tenant Redis keys grow unbounded as tenants are added. We set a key expiry (PEXPIRE) equal to the time to fully refill the bucket — a tenant that hasn't made a request in that window has their key evicted. On next request, the bucket is initialized at full capacity, which is the correct behavior.

Observability

Every rate limit check emits an OpenTelemetry span with attributes: tenant_id, decision (allowed/rejected), tokens_remaining, and the Redis round-trip latency. A Prometheus counter tracks rejections per tenant per minute. An alert fires if any tenant's rejection rate exceeds 30% over a 5-minute window.

Production Insight

The rejection rate alert caught a client-side retry bug in a tenant's SDK: on 429 responses, the client was immediately retrying instead of backing off, creating a feedback loop that amplified the rate limiting. The alert + trace data made this diagnosable within 5 minutes.

Lessons & What I'd Improve

The 60-second config cache TTL was set arbitrarily. In practice, configuration changes need to propagate faster during incidents (e.g., emergency rate limit increase for a specific tenant). I'd add a cache invalidation signal via Redis Pub/Sub — when an operator updates a tenant's config, a message is published and all gateway replicas flush that tenant's config entry immediately.

I'd also add a per-tenant rate limit dry-run mode: a flag that runs the token bucket check and records the decision without enforcing it. This allows us to test new configurations against live traffic before committing, reducing the risk of misconfigured limits disrupting tenants.