SanketLens: A Multi-Tenant gRPC Observability Engine

100K+

Spans/sec target

10K+

Apps across tenants

<1s

Report P95 (30d window)

Problem & Constraints

Traditional backend observability requires every project to operate its own Prometheus, Grafana, OpenTelemetry Collector, Tempo, Loki, dashboards, scrape config, and alert rules. SanketLens replaces this with one gRPC endpoint, one ingest token and one report token per app, SDK presets per framework, and app-scoped reports with no PromQL knowledge required — while enforcing tenant isolation at the platform, not at the app.

Constraints: language-agnostic on the producer side, storage-agnostic on the consumer side, deterministic in the hot path, and operationally safe under SDK misbehavior, broker duplication, cache outages, and uneven per-tenant traffic.

Positioning

Strong positioning

A multi-tenant gRPC observability gateway that removes repeated observability setup from every backend codebase, with SDKs, CLI, scoped report APIs, load shedding, and tenant isolation built into the platform.

Target users: backend devs with multiple services, startups and agencies running many client apps, solo SaaS builders, OSS maintainers, and internal platform teams.
Not the initial target: large enterprises on Datadog / New Relic / Honeycomb / Grafana Cloud, strictly regulated environments needing day-one physical tenant isolation, or teams that require fully custom pipelines.

Target-State vs Adoption Path

This describes the target-state architecture, not day-one scope. The adoption path is layered so correctness never depends on the most complex pieces.

v1 — deterministic ingestion core: gRPC ingest gateway, token-based tenant resolution, label enforcement, Go SDK, MVP reports.
v2 — richer operational policy: per-tenant quotas, multi-level load shedding, Node.js / Python SDKs, opinionated report engine.
v3 — production hardening: SLO engine, deployment-comparison reports, embeddable admin-panel reports, Java/PHP/Rust SDKs, OTLP HTTP compatibility.

Architecture Overview

Initial deployment is a modular monolith — control plane, ingest gateway, telemetry pipeline, and report engine live in one binary and communicate in-process, but are structured as independent modules with clean boundaries. Kafka/NATS is wired in from day one as the internal event bus, so extracting any module later requires only a deployment change.

architecture.txt

SDK (Go / Node / Python / Java / PHP / Rust)        ↓ gRPC + ingest tokenAPI Gateway / TLS Termination        ↓----------------------------------------------Ingest Gateway     │ Control PlaneTelemetry Pipeline │ Report Engine                   │ Auth & Token Service----------------------------------------------        ↓Internal Event Bus (Kafka / NATS)        ↓Worker Pool (Go routines)        ↓Data Layer   ├── PostgreSQL — tenants, projects, apps, tokens, quotas   ├── ClickHouse — metrics, spans, logs (tenant-labeled)   ├── Object store — long-term archive   └── Redis — token cache, quota counters, hot scheduling        ↓Observability (Prometheus + OpenTelemetry + Zap)

Extraction trigger

Ingest Gateway extraction is considered when sustained ingestion across any single tenant exceeds 50,000 spans/sec for 10 minutes, or service CPU stays above 70% for 10 minutes. Report Engine extraction follows the same pattern with a read-side P99 latency trigger above 2 seconds. Extraction is an operator decision after confirming stable event contracts, idempotent consumers, and observability coverage.

Key Design Decisions

Token decides tenant identity (gateway overwrites tenant labels)

Why: The single invariant on which multi-tenant safety rests. SDK-provided values for workspace/project/app/env are never trusted; the Gateway overwrites them with token-resolved values before forwarding.

Alternative: Trust SDK-supplied tenant labels: simpler but every misconfigured client becomes a tenant-isolation vulnerability.

Modular monolith with Kafka/NATS wired in from day one

Why: Faster development and simpler ops early, but no architectural rewrite is needed to extract modules later — only a deployment change.

Alternative: Microservices from day one: more moving parts, more ops surface, no business value until scale demands it.

Separate token kinds (ingest / report / admin / cli)

Why: A leaked ingest token cannot read reports; a leaked report token cannot ingest telemetry. Enforced at the interceptor, before any handler logic runs.

Alternative: Single all-powerful token: one leak exposes everything.

gRPC-only ingestion in v1; OTLP-HTTP as later compatibility mode

Why: Stronger typed contracts, streaming support, language-neutral SDK foundation. Browser UI + CLI mask friction for common access patterns.

Alternative: OTLP-HTTP as a first-class parallel path: broader compatibility but doubled surface area and weaker contracts.

Shared ClickHouse partitioned by (workspace_id, day)

Why: Tenant isolation enforced physically on disk via partition key, with token / gateway / query enforcement as additional layers. Lower cost and simpler ops than per-tenant clusters.

Alternative: Per-tenant DB by default: high isolation, prohibitive cost; reserved for enterprise compliance tiers.

Opinionated named reports, no public PromQL surface

Why: Reports answer concrete questions (is my app healthy, which endpoint is slow, are workers failing). Lower cognitive cost for backend developers who do not want to learn PromQL.

Alternative: Expose Grafana dashboards: more flexibility, but defeats the entire 'no observability setup' premise.

WAL-buffered acks; ack on WAL write, not ClickHouse confirmation

Why: SDK-visible latency stays low even when ClickHouse is in compaction-induced slowdown. Pipeline drains the WAL with rate limiting on recovery.

Alternative: Ack on storage confirmation: simpler invariant, but every storage hiccup propagates back to producers.

Ingest Gateway — Request Lifecycle

ingest_lifecycle.txt

gRPC ingest request arrives with token in metadata        ▼Token cache lookup (Redis → fall through to PostgreSQL)   ├── Not found / expired / revoked → reject Unauthenticated   └── Found → load tenant scope + allowed services + limits        ▼Service identity check   ├── Service not in allowlist → reject PermissionDenied   └── Allowed → continue        ▼Per-token rate limit (token bucket in Redis)   ├── Exceeded → ResourceExhausted, emit INGEST_THROTTLED   └── OK → continue        ▼Quota check (workspace + app + signal type)   ├── Exceeded → ResourceExhausted, emit QUOTA_EXCEEDED   └── OK → continue        ▼Overwrite tenant-labeled fields on every record:   workspace_id, project_id, app_id, environment_id,   service_id, instance_id        ▼Forward to Telemetry Pipeline → WAL ack        ▼Emit INGEST_ACCEPTED with byte / record counters

Under pressure the Gateway sheds in priority order: keep counter-style metrics, downsample histograms, sample traces by configured rate (never drop root spans before children), drop debug logs first then info, reject records over the per-tenant cardinality cap, throttle the noisiest tokens before throttling shared infrastructure. Shed records are counted into dropped_spans_total, dropped_logs_total, rate_limited_requests_total, quota_exceeded_total, and high_cardinality_rejections — all surfaced through the Quota Report.

Event Envelope & Delivery Contract

Event	Producer	Consumer(s)
INGEST_ACCEPTED	Ingest Gateway	Quota Tracker, Telemetry Pipeline
INGEST_REJECTED	Ingest Gateway	Quota Tracker, Audit Log, Notification
INGEST_THROTTLED	Ingest Gateway	Quota Tracker, Report Engine, Notification
TELEMETRY_WRITTEN	Telemetry Pipeline	Report Engine, Usage Tracker
CARDINALITY_REJECTED	Telemetry Pipeline	Audit Log, Notification
TOKEN_REVOKED	Auth Service	Ingest Gateway cache invalidator
QUOTA_EXCEEDED	Quota Tracker	Notification, Report Engine
REPORT_FETCHED	Report Engine	Audit Log, Usage Tracker

Bus is at-least-once, never exactly-once.
All consumers are idempotent, keyed by event_id via the processed_events table.
Duplicate delivery is expected and handled, not treated as exceptional.
Every envelope carries workspace_id / project_id / app_id / environment_id to enforce scope isolation at the bus level.

Control-Plane Schema (PostgreSQL)

Telemetry data lives in ClickHouse, not PostgreSQL. The relational model below covers control-plane state only — tenants, projects, apps, tokens, usage counters, and idempotency state. Migrations are managed with Ent + Atlas; ClickHouse migrations are managed separately via golang-migrate.

tokens.sql

CREATE TABLE tokens (    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),    workspace_id    UUID NOT NULL REFERENCES workspaces(id),    project_id      UUID REFERENCES projects(id),    application_id  UUID REFERENCES applications(id),    environment_id  UUID REFERENCES environments(id),    kind            TEXT NOT NULL,            -- ingest | report | admin | cli    token_hash      TEXT NOT NULL UNIQUE,     -- argon2/sha256 of bearer    allowed_services JSONB,                   -- services this token may write    rate_limit_rps  INT,    quota_bytes_day BIGINT,    expires_at      TIMESTAMPTZ,    revoked_at      TIMESTAMPTZ,    created_at      TIMESTAMPTZ NOT NULL DEFAULT now());CREATE INDEX idx_tokens_workspace ON tokens(workspace_id);CREATE INDEX idx_tokens_kind    ON tokens(kind) WHERE revoked_at IS NULL;

spans_clickhouse.sql

CREATE TABLE spans (    workspace_id    UUID,    project_id      UUID,    application_id  UUID,    environment_id  UUID,    service_id      UUID,    service_name    LowCardinality(String),    trace_id        UUID,    span_id         UInt64,    parent_span_id  UInt64,    name            LowCardinality(String),    kind            LowCardinality(String),    status_code     LowCardinality(String),    started_at      DateTime64(9),    duration_ns     UInt64,    attributes      Map(LowCardinality(String), String))ENGINE = MergeTreePARTITION BY (workspace_id, toYYYYMMDD(started_at))ORDER BY (workspace_id, application_id, started_at, trace_id)TTL started_at + INTERVAL 30 DAY;

Auth & Tenant Boundary

Workspace admins authenticate through an external OIDC provider; the platform's bearer tokens are the source of truth for machine identity (SDKs, CLIs, embedded admin panels). token_hash is stored, never the bearer secret. The Gateway holds resolved scopes in Redis keyed by hash with a 60s TTL; revocation publishes TOKEN_REVOKED so the cache invalidator drops the entry within seconds.

Token kind	Ingest	Reports	Settings
ingest	✓	✗	✗
report	✗	✓	✗
admin	✗	✓	✓
cli	scoped by role	scoped by role	scoped by role

Token layer: every bearer resolves to exactly one workspace; no cross-workspace token can be issued.
Gateway layer: tenant-labeled fields in the payload are overwritten with values resolved from the token.
Query layer: every Report Engine query injects workspace_id (and application_id where relevant) from the token, never from request parameters. ClickHouse partitioning by workspace_id enforces this physically on disk.

Reliability & Failure Modes

The system is designed under the assumption that SDKs send malformed telemetry, brokers duplicate messages, caches fail, storage backends time out, and traffic is uneven across tenants.

ClickHouse failure: Gateway buffers to a local WAL; on recovery the pipeline drains with rate limiting. WAL >70% sheds low-priority logs; >90% increases trace sampling; error signals are never shed.
Redis failure: Gateway falls through to PostgreSQL for token validation; quota enforcement degrades to coarse-grained per-process counters until Redis returns.
Broker failure: telemetry path does not depend on the bus; control-plane consumers may replay, and idempotent handling prevents duplicate effects.
SDK misbehavior: per-tenant cardinality runaway is rejected with CARDINALITY_REJECTED; a noisy service is throttled at the token level before pressure reaches shared infrastructure.
Circuit breakers wrap external dependencies (object store, optional embedding APIs) so upstream slowness does not cascade.

Operational Scenarios

First Go monolith onboarding

Developer creates Workspace, Project, Application, Environment, and Services in the UI. Control plane issues an ingest token and a report token, each returned exactly once. Developer installs the Go SDK, adds the init block, the HTTP middleware, the worker wrapper, and SANKETLENS_ENDPOINT + SANKETLENS_TOKEN env vars. On startup the SDK performs a setup-validation handshake; reports appear within 60 seconds. End-to-end setup under 10 minutes with no local Prometheus / Grafana / Collector / Tempo / Loki.

Leaked ingest token used to read reports

Attacker calls ReportService.GetHealth with a leaked ingest token. The Report Service interceptor validates token kind and rejects with PermissionDenied. Audit log captures token id, source IP, and timestamp; operator rotates the token and the Gateway cache invalidator drops the entry within seconds. Token-type separation contains the leak.

ClickHouse degraded latency

ClickHouse enters a compaction-induced slowdown; insert latency rises from 50ms to 4s. The Ingest Gateway continues to accept telemetry and writes to the local WAL — acks stay fast because they are gated on WAL write, not ClickHouse confirmation. When WAL utilization crosses 70%, low-priority logs begin shedding; at 90%, traces sample more aggressively; error-level signals are never shed. SDK-visible behavior is unaffected for the first several minutes.

Tradeoffs Summary

Modular monolith vs microservices: faster early development, with an operator-driven extraction path when ingestion or report latency thresholds are sustained.
gRPC-only ingestion vs broad adoption: stronger contracts and streaming, with OTLP-HTTP reserved as a v3 compatibility mode rather than a parallel first-class path.
Shared multi-tenant backend vs per-tenant storage: lower cost and easier ops, compensated by three layers of isolation (token, gateway, query) plus physical partitioning in ClickHouse.
Opinionated reports vs raw dashboards: less flexibility for power users, dramatically lower cognitive cost for the target developer; internal Grafana exists only for operator debugging.
SDK depth first, language breadth later: each official SDK must reach 'boring setup' quality (install, init, middleware, restart) before being claimed as supported.
Reliability complexity vs operational safety: explicit idempotency, retries, DLQs, circuit breakers, WAL buffering, and multi-level rate limiting from the beginning — predictable behavior under failure instead of ad hoc operator intervention.

All Case Studies