KaryaFlow: Building a Distributed Job Queue Without a Coordinator

40K+

Daily jobs processed

<200ms

P99 job latency

99.99%

Exactly-once rate

Problem & Constraints

Multi-tenant SaaS platforms need background job processing that is reliable, observable, and tenant-isolated. Off-the-shelf queues either sacrifice exactly-once semantics or require heavyweight coordination. KaryaFlow needed to process 40K+ daily jobs with sub-200ms P99 latency, zero cross-tenant interference, and automatic retry with exponential backoff — without a distributed coordinator.

Architecture Overview

The system uses a Redis-backed lease model: each job is atomically acquired with a Lua script that checks the lease key, sets a TTL, and returns the job in a single round-trip. Workers heartbeat every 5 seconds; if a worker dies, the lease expires and another worker reclaims the job. A PostgreSQL jobs table is the source of truth; Redis is the scheduling layer only.

Key Design Decisions

Redis Lua for lease acquisition

Why: A Lua script runs atomically server-side — no race condition between CHECK and SET. Eliminates the need for distributed locking.

Alternative: Pessimistic DB row locking: works but creates contention under high concurrency.

Lease expiry over heartbeat deletion

Why: If a worker crashes before deleting its lease, TTL-based expiry automatically makes the job available for retry — no cleanup process needed.

Alternative: Explicit heartbeat renewal: more complex, same result, more moving parts.

PostgreSQL as source of truth

Why: Redis is ephemeral. If Redis data is lost, jobs must be recoverable. Postgres provides durability and allows audit queries.

Alternative: Redis-only: faster, but loses durability guarantees.

Tenant-hash sharding for Redis

Why: Consistent hashing by tenant ID gives isolation as a side effect — one tenant's job burst can't starve another's queue.

Alternative: Single Redis cluster: simpler, but creates noisy-neighbor risk.

Lua Script: Atomic Lease Acquisition

acquire_job.lua

-- Atomically acquire a job if it has no active leaselocal job_id = KEYS[1]local lease_key = "lease:" .. job_idlocal worker_id = ARGV[1]local lease_ttl = tonumber(ARGV[2])  -- seconds-- If lease exists, another worker owns itif redis.call("EXISTS", lease_key) == 1 then  return 0  -- could not acquireend-- Acquire the lease atomicallyredis.call("SET", lease_key, worker_id, "EX", lease_ttl)return 1  -- acquired successfully

Tradeoffs

Redis as a scheduling layer introduces a potential split-brain: if Redis goes down, workers see no jobs but PostgreSQL still has them. The failsafe is a periodic reconciliation query that re-enqueues jobs whose leases expired without completion. This adds ~30 seconds of delay in the worst case — acceptable for background processing but not for latency-sensitive tasks. For those, a direct database queue approach would be more reliable, at the cost of higher DB load.

Failure Handling

Three failure modes require explicit handling: (1) Worker crash mid-execution — lease expiry + reconciliation reclaims the job. (2) Idempotency violation — jobs carry an idempotency key stored in a Redis set with a 24h TTL; duplicate executions are no-ops. (3) Poison jobs — after 5 retries with exponential backoff (1s, 2s, 4s, 8s, 16s), jobs move to a dead-letter queue for manual inspection.

Scaling Strategy

Workers are stateless Kubernetes pods — horizontal scaling adds throughput linearly. The bottleneck is Redis throughput (100K+ ops/sec on a single instance). For higher scale, the job namespace is hash-partitioned across multiple Redis instances by tenant ID; each tenant's jobs go to a deterministic shard. This gives tenant isolation as a side effect.

Lessons & What I'd Improve

The hardest part wasn't the queue itself — it was the observability layer. Without per-job traces, debugging stuck queues meant correlating Redis MONITOR output with application logs manually. If I rebuilt this, I'd add OpenTelemetry spans from job enqueue to completion as a first-class concern, not an afterthought. I'd also replace the periodic reconciliation loop with a PostgreSQL LISTEN/NOTIFY trigger to reduce the 30-second failure window to under 1 second.

All Case Studies Related: Exactly-Once Job Queue →