Problem & Constraints
Multi-tenant SaaS platforms need background job processing that is reliable, observable, and tenant-isolated. Off-the-shelf queues either sacrifice exactly-once semantics or require heavyweight coordination. KaryaFlow needed to process 40K+ daily jobs with sub-200ms P99 latency, zero cross-tenant interference, and automatic retry with exponential backoff — without a distributed coordinator.
Architecture Overview
The system uses a Redis-backed lease model: each job is atomically acquired with a Lua script that checks the lease key, sets a TTL, and returns the job in a single round-trip. Workers heartbeat every 5 seconds; if a worker dies, the lease expires and another worker reclaims the job. A PostgreSQL jobs table is the source of truth; Redis is the scheduling layer only.
Key Design Decisions
Redis Lua for lease acquisition
Why: A Lua script runs atomically server-side — no race condition between CHECK and SET. Eliminates the need for distributed locking.
Alternative: Pessimistic DB row locking: works but creates contention under high concurrency.
Lease expiry over heartbeat deletion
Why: If a worker crashes before deleting its lease, TTL-based expiry automatically makes the job available for retry — no cleanup process needed.
Alternative: Explicit heartbeat renewal: more complex, same result, more moving parts.
PostgreSQL as source of truth
Why: Redis is ephemeral. If Redis data is lost, jobs must be recoverable. Postgres provides durability and allows audit queries.
Alternative: Redis-only: faster, but loses durability guarantees.
Tenant-hash sharding for Redis
Why: Consistent hashing by tenant ID gives isolation as a side effect — one tenant's job burst can't starve another's queue.
Alternative: Single Redis cluster: simpler, but creates noisy-neighbor risk.
Lua Script: Atomic Lease Acquisition
-- Atomically acquire a job if it has no active leaselocal job_id = KEYS[1]local lease_key = "lease:" .. job_idlocal worker_id = ARGV[1]local lease_ttl = tonumber(ARGV[2]) -- seconds-- If lease exists, another worker owns itif redis.call("EXISTS", lease_key) == 1 then return 0 -- could not acquireend-- Acquire the lease atomicallyredis.call("SET", lease_key, worker_id, "EX", lease_ttl)return 1 -- acquired successfullyTradeoffs
Redis as a scheduling layer introduces a potential split-brain: if Redis goes down, workers see no jobs but PostgreSQL still has them. The failsafe is a periodic reconciliation query that re-enqueues jobs whose leases expired without completion. This adds ~30 seconds of delay in the worst case — acceptable for background processing but not for latency-sensitive tasks. For those, a direct database queue approach would be more reliable, at the cost of higher DB load.
Failure Handling
Three failure modes require explicit handling: (1) Worker crash mid-execution — lease expiry + reconciliation reclaims the job. (2) Idempotency violation — jobs carry an idempotency key stored in a Redis set with a 24h TTL; duplicate executions are no-ops. (3) Poison jobs — after 5 retries with exponential backoff (1s, 2s, 4s, 8s, 16s), jobs move to a dead-letter queue for manual inspection.
Scaling Strategy
Workers are stateless Kubernetes pods — horizontal scaling adds throughput linearly. The bottleneck is Redis throughput (100K+ ops/sec on a single instance). For higher scale, the job namespace is hash-partitioned across multiple Redis instances by tenant ID; each tenant's jobs go to a deterministic shard. This gives tenant isolation as a side effect.
Lessons & What I'd Improve
The hardest part wasn't the queue itself — it was the observability layer. Without per-job traces, debugging stuck queues meant correlating Redis MONITOR output with application logs manually. If I rebuilt this, I'd add OpenTelemetry spans from job enqueue to completion as a first-class concern, not an afterthought. I'd also replace the periodic reconciliation loop with a PostgreSQL LISTEN/NOTIFY trigger to reduce the 30-second failure window to under 1 second.