Back to Case StudyCase Study · Distributed Task Orchestration

KaryaFlow: A Distributed Autonomous Task Orchestration System

An agentic control plane with deterministic scheduling and optional retrieval-augmented decision intelligence — designed so no task remains idle without system intervention, and so correctness never depends on LLM availability.

GoKafkaPostgreSQLRedispgvectorOpenTelemetry
10K+
Active users target
100K+
Concurrent tasks
5 min
Cron sweep safety net

Problem & Constraints

Traditional task managers rely on manual updates and human discipline; tasks routinely sit idle without intervention. KaryaFlow replaces this with continuous monitoring, deterministic operational enforcement, mandatory interaction loops, and optional context-aware reasoning. The Discord layer is only an interface adapter — the core is an interface-agnostic backend designed to also integrate with Slack, email, internal chat, web dashboards, or custom enterprise software.

Constraints: scheduling correctness, escalation, and tenant isolation must not depend on LLM availability or retrieval quality. The system must remain operationally safe under broker duplication, Redis outage, AI provider timeout, and uneven per-tenant traffic.

Task Ownership & Lifecycle

Tasks flow through Pending → Processing → Review → Closed. Only active states (Pending, Processing, Review) are monitored by the agent. Operational conditions are modeled separately from the primary workflow state as flags/timestamps rather than peer states: is_blocked, escalated_at, and needs_manual_review. This keeps the lifecycle linear and the operational flags composable.

  • blocked — task cannot currently progress.
  • escalated — accountability escalation has occurred.
  • needs_manual_review — the agent could not safely classify or resolve the latest interaction.

Target-State vs Adoption Path

  • v1 — deterministic control loop: hybrid scheduler, rules-based state transitions, explicit escalation policy, no AI dependency in the hot path.
  • v2 — richer operational policy: config-driven thresholds, stronger reporting and operator controls, more nuanced rule engine.
  • v3 — intelligent augmentation: LLM-assisted response classification, retrieval-informed prompting, behavioral context for adaptive follow-up.

Architecture Overview

Initial deployment is a distributed monolith — all services live in one binary and communicate in-process, but are structured as independent modules. Kafka/NATS is wired in from day one so extracting a module later requires only a deployment change.

architecture.txt
Discord Adapter (Bot / Discord SDK)        API Gateway (Go  JWT-authenticated)        ------------------------------------------Task Service    Auth ServiceAgent Service   Retrieval ServiceNotification    AI ServiceService        ------------------------------------------        Message Broker (Kafka / NATS)        Worker Pool (Go routines)        Data Layer (Postgres + Redis + pgvector / Weaviate)        Observability (Prometheus + OpenTelemetry + Zap)

Extraction trigger

Agent Service extraction is considered when concurrent active tasks for a single tenant exceed 50,000, or service CPU stays above 70% for 10 minutes. Extraction is an operator decision after confirming stable event contracts, idempotent consumers, observability coverage, and a persistent hotspot — not a transient spike.

Agent Scheduler — Hybrid Model

The Agent Service is intentionally deterministic in the hot path. Scheduling, escalation, and state progression are correct without AI or retrieval; intelligent services improve interpretation and prompt quality but are not required for operational safety.

  • Primary path — event-triggered: every task state change or user response emits an event. The Agent consumes it, evaluates the task, and writes next_agent_check_at to the task record.
  • Safety net — cron sweep: a background goroutine runs every 5 minutes querying for tasks where next_agent_check_at ≤ now AND workflow_state IN ('pending','processing','review'). Catches any task whose event was missed due to broker failure or restart.
  • Combined, this guarantees at-least-once evaluation without relying solely on event delivery.

Agent Decision Tree

agent_decision.txt
Task enters evaluation                Has the user responded to the last prompt?    YES  Parse response intent (rules first, AI optional)                "start" / "progress"   update state, schedule next check                                           based on commitment                "blocked"               set blocked flag, notify manager                "delay: N hours"        next_agent_check_at = now + N hours                unclassified            mark needs_manual_review,                                             fall back to default follow-up       NO  How many consecutive missed responses?                1 miss   send reminder                2 misses  send re-check with higher urgency                3+ misses  trigger escalation                                notify manager (RBAC: manager role)                                attach context summary if available                                set escalated_at

Deterministic classification path: explicit rules and keyword extraction cover start, progress, blocked, done, and simple time commitments. Ambiguous text optionally calls the AI Service; if AI is unavailable or times out, the task is marked needs_manual_review and falls back to a default reminder path. Retrieval is consulted before sending a prompt — if a repeat-delay pattern is detected for the user, commitment windows are shortened and follow-up frequency is increased. If retrieval is unavailable, the Agent continues with default enforcement policy.

commitment_cap.txt
if user.delay_count_last_30_days >= 3:    max_commitment = min(user_requested_hours, 2)else:    max_commitment = user_requested_hours

In production these values are config-driven rather than hardcoded constants. Review ownership is policy-driven but assignment is explicit: when a task enters review, the system writes a concrete reviewer_id and reminders/escalations target that stored assignment rather than re-deriving ownership on each evaluation.

Key Design Decisions

Hybrid event-triggered + cron-sweep scheduler

Why: Events deliver low-latency reactions; the cron sweep is the correctness safety net for missed evaluations. Together they give at-least-once evaluation without depending on perfect broker delivery.

Alternative: Pure event-driven: simpler but a single missed event can leave a task stuck silently.

Deterministic hot path; AI and retrieval are augmentations

Why: Scheduling and escalation are correct without AI. If the model is unavailable, the system degrades to rule-based classification or marks needs_manual_review rather than halting the scheduler.

Alternative: LLM-in-the-loop control plane: better natural-language understanding, but every model outage is now a production outage.

Workflow state + operational flags (not peer states)

Why: blocked / escalated / needs_manual_review are composable conditions on top of a linear lifecycle. Keeps the state machine small and operational signals expressive.

Alternative: Add blocked / escalated as primary workflow states: more states, exploding transition matrix, ambiguous semantics.

Storage-backed idempotency via processed_events

Why: Every consumer records (consumer_name, event_id) before applying irreversible side effects. Duplicate redelivery becomes a no-op rather than a repeated escalation or notification.

Alternative: Trust broker exactly-once semantics: pragmatically unreliable; redelivery happens in real systems.

Commitment-based scheduling instead of fixed polling

Why: Avoids reminder spam and tracks commitment vs actual behavior. The Agent re-engages precisely at the user's promised time.

Alternative: Fixed-interval reminders: simpler but noisy and ignored quickly.

Embed at close-time, not creation-time

Why: Vectors represent completed behavioral patterns rather than intentions. Nightly batch refreshes user behavior logs.

Alternative: Embed on creation: cheap context but represents intent, not behavior.

Data Model

  • All core tables carry tenant_id (and division_id where relevant) for row-level isolation.
  • Soft deletes via deleted_at — no hard deletes on user or task records.
  • Append-only audit_logs for all state-changing operations.
  • Migrations managed with Ent + Atlas versioned migrations.
tasks.sql
CREATE TABLE tasks (    id                    UUID PRIMARY KEY DEFAULT gen_random_uuid(),    tenant_id             UUID NOT NULL REFERENCES tenants(id),    division_id           UUID NOT NULL REFERENCES divisions(id),    assignee_id           UUID NOT NULL REFERENCES users(id),    creator_id            UUID NOT NULL REFERENCES users(id),    reviewer_id           UUID REFERENCES users(id),    title                 TEXT NOT NULL,    description           TEXT,    workflow_state        TEXT NOT NULL DEFAULT 'pending',    -- pending | processing | review | closed    priority              TEXT NOT NULL DEFAULT 'medium',    due_at                TIMESTAMPTZ,    committed_until       TIMESTAMPTZ,    next_agent_check_at   TIMESTAMPTZ,    miss_count            INT NOT NULL DEFAULT 0,    is_blocked            BOOLEAN NOT NULL DEFAULT false,    blocked_at            TIMESTAMPTZ,    needs_manual_review   BOOLEAN NOT NULL DEFAULT false,    manual_review_at      TIMESTAMPTZ,    escalated_at          TIMESTAMPTZ,    embedding_id          UUID,    created_at            TIMESTAMPTZ NOT NULL DEFAULT now(),    updated_at            TIMESTAMPTZ NOT NULL DEFAULT now(),    deleted_at            TIMESTAMPTZ);-- Primary lookup path for the cron sweep — partial to stay fast at scaleCREATE INDEX idx_tasks_next_check ON tasks(next_agent_check_at)    WHERE workflow_state IN ('pending','processing','review')      AND deleted_at IS NULL;
processed_events.sql
CREATE TABLE processed_events (    id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),    tenant_id     UUID NOT NULL REFERENCES tenants(id),    consumer_name TEXT NOT NULL,    event_id      UUID NOT NULL,    event_type    TEXT,    aggregate_id  UUID,    processed_at  TIMESTAMPTZ NOT NULL DEFAULT now());CREATE UNIQUE INDEX uq_processed_events_consumer_event    ON processed_events(consumer_name, event_id);

Auth, RBAC & Tenant Boundary

KaryaFlow does not own user credentials. External identity providers (Discord, Slack) are the source of truth for authentication. On first interaction the Auth Service auto-provisions an internal user record, then issues a short-lived JWT (15 min, RS256) containing user_id, tenant_id, division_id, and role. Refresh tokens (7d) are stored in Redis as refresh:{user_id} for server-side revocation.

RoleCreateAssign othersView division tasksEscalateReportsManage users
memberown tasks only
manager
admin✓ (all)
  • Token layer: JWT carries tenant_id; no cross-tenant token can be issued.
  • API layer: every handler injects tenant_id from the token, never from request parameters.
  • Database layer: all core tables carry tenant_id; PostgreSQL row-level security enforces tenant_id = current_setting('app.tenant_id') as defense-in-depth.

Embedding & Retrieval Pipeline

This subsystem is a later-phase augmentation layer. It improves classification quality, prompt personalization, and behavioral policy tuning, but the system remains operationally correct when retrieval is disabled.

Document typeContentEmbedded when
Task summarytitle + description + final workflow state + durationTask reaches Closed or long-blocked
Interaction summaryCondensed agent ↔ user thread per taskTask closed
User behavior logAggregated delay / block / miss counts per user/monthNightly batch job
retrieval_query.json
{  "query": "<task title + description>",  "filters": {    "tenant_id": "...",    "user_id": "...",    "doc_type": ["task_summary", "user_behavior_log"]  },  "top_k": 5,  "min_score": 0.75}

min_score = 0.75 cosine similarity. Below threshold the Agent proceeds with no retrieved context rather than low-quality context. Retrieved documents are injected as structured context blocks, not raw text, so the model can clearly distinguish historical patterns from the current task. If retrieval latency breaches its budget or the vector store is unavailable, the Agent continues with default policy — retrieval is allowed to fail closed into deterministic behavior.

Reliability & Failure Modes

  • Broker failure: the cron sweep remains the correctness safety net for missed evaluations; idempotent consumers absorb replay safely.
  • Redis failure: an optimization layer for caches and hot scheduling data; correctness-critical flows degrade to PostgreSQL-backed reads rather than stopping task progression.
  • AI timeout or provider failure: classification is bypassed; ambiguous responses fall back to rule-based parsing or needs_manual_review. Core scheduler timing and escalation continue unaffected.
  • Retrieval / vector-store failure: the Agent omits behavioral context and uses default commitment and escalation policy.
  • Bounded exponential-backoff retries; after the threshold, events move to a DLQ with depth and age as operational signals.

Operational Scenarios

Time-based commitment

Task pending for an extended period. Agent prompts: 'How much time do you need?' User says 'Need 2 hours'. Deterministic parsing extracts delay = 2h; AI is not required. next_agent_check_at and committed_until are set to now + 2h. After 2 hours the cron sweep picks up the task → Agent re-evaluates → Task moves to Processing. No unnecessary reminders; commitment enforced with precision.

Silent failure & escalation

Task in processing for 48h with no response. Agent prompt → no response in 4h → miss_count = 1 → reminder. No response in 2h → miss_count = 2 → urgent re-check. Still no response → miss_count = 3 → escalation. Agent sets escalated_at, publishes AGENT_ESCALATED, and Notification Service pings the manager with task context and optional behavior summary.

Repeated delay pattern

User with delay_count_30d = 4 requests 'I need 8 hours'. The configured cap applies: max_commitment = min(8, 2) = 2 hours. next_agent_check_at is set to now + 2h, overriding the request. Retrieval context confirms the pattern → Agent increases follow-up frequency. Nightly batch regenerates the user behavior embedding with updated stats.

AI timeout during classification

User responds ambiguously: 'trying, maybe later if blockers clear'. Rules cannot confidently classify. Agent calls AI Service; the request times out. Circuit breaker records the failure; the Agent marks needs_manual_review = true. next_agent_check_at is still scheduled using default reminder policy — no escalation or transition is lost. AI failure degrades decision quality, not system correctness.

Tradeoffs Summary

  • Distributed monolith vs microservices: faster development, with a defined operator-driven extraction path when per-tenant load or service CPU thresholds are sustained.
  • Event-driven vs synchronous calls: event-driven for cross-module communication; synchronous gRPC for real-time reads. Synchronous paths are bounded and explicit.
  • RAG vs traditional querying: complementary — SQL for structured queries, vector retrieval for behavioral patterns. Retrieval remains optional and the control loop runs without it.
  • LLM usage: only for response classification and prompt generation; scheduling and escalation remain deterministic. Graceful degradation under model outage.
  • Ent + Atlas vs raw SQL migrations: type-safe queries, automatic migration diffing, and schema drift detection in CI.
  • Reliability complexity vs operational safety: explicit idempotency, retries, DLQs, circuit breakers, and tenant-aware rate limiting from the beginning.
Share