Problem & Constraints
Compliance requirements mandate that every user action — authentication, data modification, permission change — be recorded durably and queryable within seconds. The system had 12 microservices producing events in inconsistent formats, no schema enforcement, and a direct-to-database audit write path that was creating lock contention under peak load. One missed write was a compliance violation.
Requirements: zero event loss, sub-100ms ingestion latency, retroactive schema enforcement across existing producers without a rewrite, and OLAP-queryable storage for compliance reports spanning months of data.
Architecture Overview
Each microservice publishes events to a Kafka topic using an idiomatic Go client wrapper that enforces Avro schema validation against Confluent Schema Registry before the produce call. The broker has replication factor 3 and min.insync.replicas=2, ensuring durability against single-node failure. A Go consumer group reads events, applies transformations, and bulk-inserts into ClickHouse via the native protocol.
ClickHouse stores events in a MergeTree table partitioned by date and sorted by (tenant_id, event_type, timestamp). This allows compliance queries like "all auth events for tenant X in March" to touch only a narrow partition range without a full scan.
Key Design Decisions
Kafka with exactly-once producers
Why: enable.idempotence=true + transactional.id ensures each event is written to the broker exactly once, even on producer retry. Combined with consumer group offsets committed in the same transaction as ClickHouse inserts, the pipeline is end-to-end exactly-once.
Alternative: At-least-once + deduplication at read time: simpler setup but requires dedup logic in every query.
Schema Registry enforcement at produce time
Why: Rejecting malformed events before they enter the pipeline is far cheaper than discovering schema drift at query time. The producer wrapper returns an error immediately if the event fails Avro validation — the producing service must fix its payload.
Alternative: Schema validation at consumer: allows bad events into the pipeline; consumer errors create lag and potential event loss.
ClickHouse over Elasticsearch for audit storage
Why: Compliance queries are aggregations over time ranges with filters — exactly what a columnar OLAP engine handles efficiently. ClickHouse's bulk insert throughput (millions of rows/sec) handles our peak load with headroom.
Alternative: Elasticsearch: strong full-text search but poor aggregate performance and expensive at this scale.
Consumer group with manual offset commit
Why: Auto-commit can commit offsets for events that have not yet been durably written to ClickHouse, creating a gap if the consumer crashes. Manual commit after successful insert closes this gap.
Alternative: Auto-commit: simpler code, but allows data loss in crash scenarios.
Event Schema & Avro Example
{ "type": "record", "name": "AuditEvent", "namespace": "com.worldtech.audit", "fields": [ { "name": "event_id", "type": "string" }, { "name": "tenant_id", "type": "string" }, { "name": "actor_id", "type": "string" }, { "name": "event_type", "type": "string" }, { "name": "resource", "type": "string" }, { "name": "action", "type": "string" }, { "name": "timestamp", "type": "long", "logicalType": "timestamp-millis" }, { "name": "metadata", "type": { "type": "map", "values": "string" }, "default": {} } ]}Tradeoffs
Exactly-once semantics come with a performance cost: transactional produces have ~20% lower throughput than non-transactional. At 5M events/day this is acceptable, but at 10x scale we would need to evaluate whether the compliance requirement could be met with at-least-once + idempotent consumers instead.
ClickHouse excels at batch inserts but has high latency for single-row lookups. For real-time audit trail display in the UI, we maintain a separate Redis sorted set per user that holds the last 100 events. ClickHouse handles the compliance reports; Redis handles the real-time UI.
Failure Modes & Recovery
Three scenarios require explicit handling: (1) Kafka broker partition leader failure — Kafka automatically re-elects a leader from in-sync replicas; with replication factor 3, one broker failure is transparent. (2) ClickHouse insert failure — the consumer retries with exponential backoff up to 5 times before writing the failed batch to a dead-letter topic for manual replay. (3) Schema Registry unavailability — producers fail closed (reject the event) rather than fail open.
Lessons & What I'd Improve
Schema evolution was harder than anticipated. Avro's backward/forward compatibility rules are correct but not obvious in practice — adding a field with no default breaks forward compatibility, which we discovered only when deploying a schema change to staging. I'd add a CI check that runs avro-tools compatibility against the registry on every schema change PR.
Consumer lag monitoring deserves its own dashboard. We used Prometheus JMX metrics from the Kafka broker, but the lag alert threshold was tuned too conservatively — it fired on normal peak traffic. A moving average alert (lag > 10s average over 5 minutes) would have eliminated false positives while still catching real issues.