Problem & Constraints
Traditional backend observability requires every project to operate its own Prometheus, Grafana, OpenTelemetry Collector, Tempo, Loki, dashboards, scrape config, and alert rules. SanketLens replaces this with one gRPC endpoint, one ingest token and one report token per app, SDK presets per framework, and app-scoped reports with no PromQL knowledge required — while enforcing tenant isolation at the platform, not at the app.
Constraints: language-agnostic on the producer side, storage-agnostic on the consumer side, deterministic in the hot path, and operationally safe under SDK misbehavior, broker duplication, cache outages, and uneven per-tenant traffic.
Positioning
Strong positioning
A multi-tenant gRPC observability gateway that removes repeated observability setup from every backend codebase, with SDKs, CLI, scoped report APIs, load shedding, and tenant isolation built into the platform.
- Target users: backend devs with multiple services, startups and agencies running many client apps, solo SaaS builders, OSS maintainers, and internal platform teams.
- Not the initial target: large enterprises on Datadog / New Relic / Honeycomb / Grafana Cloud, strictly regulated environments needing day-one physical tenant isolation, or teams that require fully custom pipelines.
Target-State vs Adoption Path
This describes the target-state architecture, not day-one scope. The adoption path is layered so correctness never depends on the most complex pieces.
- v1 — deterministic ingestion core: gRPC ingest gateway, token-based tenant resolution, label enforcement, Go SDK, MVP reports.
- v2 — richer operational policy: per-tenant quotas, multi-level load shedding, Node.js / Python SDKs, opinionated report engine.
- v3 — production hardening: SLO engine, deployment-comparison reports, embeddable admin-panel reports, Java/PHP/Rust SDKs, OTLP HTTP compatibility.
Architecture Overview
Initial deployment is a modular monolith — control plane, ingest gateway, telemetry pipeline, and report engine live in one binary and communicate in-process, but are structured as independent modules with clean boundaries. Kafka/NATS is wired in from day one as the internal event bus, so extracting any module later requires only a deployment change.
SDK (Go / Node / Python / Java / PHP / Rust) ↓ gRPC + ingest tokenAPI Gateway / TLS Termination ↓----------------------------------------------Ingest Gateway │ Control PlaneTelemetry Pipeline │ Report Engine │ Auth & Token Service---------------------------------------------- ↓Internal Event Bus (Kafka / NATS) ↓Worker Pool (Go routines) ↓Data Layer ├── PostgreSQL — tenants, projects, apps, tokens, quotas ├── ClickHouse — metrics, spans, logs (tenant-labeled) ├── Object store — long-term archive └── Redis — token cache, quota counters, hot scheduling ↓Observability (Prometheus + OpenTelemetry + Zap)Extraction trigger
Ingest Gateway extraction is considered when sustained ingestion across any single tenant exceeds 50,000 spans/sec for 10 minutes, or service CPU stays above 70% for 10 minutes. Report Engine extraction follows the same pattern with a read-side P99 latency trigger above 2 seconds. Extraction is an operator decision after confirming stable event contracts, idempotent consumers, and observability coverage.
Key Design Decisions
Token decides tenant identity (gateway overwrites tenant labels)
Why: The single invariant on which multi-tenant safety rests. SDK-provided values for workspace/project/app/env are never trusted; the Gateway overwrites them with token-resolved values before forwarding.
Alternative: Trust SDK-supplied tenant labels: simpler but every misconfigured client becomes a tenant-isolation vulnerability.
Modular monolith with Kafka/NATS wired in from day one
Why: Faster development and simpler ops early, but no architectural rewrite is needed to extract modules later — only a deployment change.
Alternative: Microservices from day one: more moving parts, more ops surface, no business value until scale demands it.
Separate token kinds (ingest / report / admin / cli)
Why: A leaked ingest token cannot read reports; a leaked report token cannot ingest telemetry. Enforced at the interceptor, before any handler logic runs.
Alternative: Single all-powerful token: one leak exposes everything.
gRPC-only ingestion in v1; OTLP-HTTP as later compatibility mode
Why: Stronger typed contracts, streaming support, language-neutral SDK foundation. Browser UI + CLI mask friction for common access patterns.
Alternative: OTLP-HTTP as a first-class parallel path: broader compatibility but doubled surface area and weaker contracts.
Shared ClickHouse partitioned by (workspace_id, day)
Why: Tenant isolation enforced physically on disk via partition key, with token / gateway / query enforcement as additional layers. Lower cost and simpler ops than per-tenant clusters.
Alternative: Per-tenant DB by default: high isolation, prohibitive cost; reserved for enterprise compliance tiers.
Opinionated named reports, no public PromQL surface
Why: Reports answer concrete questions (is my app healthy, which endpoint is slow, are workers failing). Lower cognitive cost for backend developers who do not want to learn PromQL.
Alternative: Expose Grafana dashboards: more flexibility, but defeats the entire 'no observability setup' premise.
WAL-buffered acks; ack on WAL write, not ClickHouse confirmation
Why: SDK-visible latency stays low even when ClickHouse is in compaction-induced slowdown. Pipeline drains the WAL with rate limiting on recovery.
Alternative: Ack on storage confirmation: simpler invariant, but every storage hiccup propagates back to producers.
Ingest Gateway — Request Lifecycle
gRPC ingest request arrives with token in metadata ▼Token cache lookup (Redis → fall through to PostgreSQL) ├── Not found / expired / revoked → reject Unauthenticated └── Found → load tenant scope + allowed services + limits ▼Service identity check ├── Service not in allowlist → reject PermissionDenied └── Allowed → continue ▼Per-token rate limit (token bucket in Redis) ├── Exceeded → ResourceExhausted, emit INGEST_THROTTLED └── OK → continue ▼Quota check (workspace + app + signal type) ├── Exceeded → ResourceExhausted, emit QUOTA_EXCEEDED └── OK → continue ▼Overwrite tenant-labeled fields on every record: workspace_id, project_id, app_id, environment_id, service_id, instance_id ▼Forward to Telemetry Pipeline → WAL ack ▼Emit INGEST_ACCEPTED with byte / record countersUnder pressure the Gateway sheds in priority order: keep counter-style metrics, downsample histograms, sample traces by configured rate (never drop root spans before children), drop debug logs first then info, reject records over the per-tenant cardinality cap, throttle the noisiest tokens before throttling shared infrastructure. Shed records are counted into dropped_spans_total, dropped_logs_total, rate_limited_requests_total, quota_exceeded_total, and high_cardinality_rejections — all surfaced through the Quota Report.
Event Envelope & Delivery Contract
| Event | Producer | Consumer(s) |
|---|---|---|
| INGEST_ACCEPTED | Ingest Gateway | Quota Tracker, Telemetry Pipeline |
| INGEST_REJECTED | Ingest Gateway | Quota Tracker, Audit Log, Notification |
| INGEST_THROTTLED | Ingest Gateway | Quota Tracker, Report Engine, Notification |
| TELEMETRY_WRITTEN | Telemetry Pipeline | Report Engine, Usage Tracker |
| CARDINALITY_REJECTED | Telemetry Pipeline | Audit Log, Notification |
| TOKEN_REVOKED | Auth Service | Ingest Gateway cache invalidator |
| QUOTA_EXCEEDED | Quota Tracker | Notification, Report Engine |
| REPORT_FETCHED | Report Engine | Audit Log, Usage Tracker |
- Bus is at-least-once, never exactly-once.
- All consumers are idempotent, keyed by event_id via the processed_events table.
- Duplicate delivery is expected and handled, not treated as exceptional.
- Every envelope carries workspace_id / project_id / app_id / environment_id to enforce scope isolation at the bus level.
Control-Plane Schema (PostgreSQL)
Telemetry data lives in ClickHouse, not PostgreSQL. The relational model below covers control-plane state only — tenants, projects, apps, tokens, usage counters, and idempotency state. Migrations are managed with Ent + Atlas; ClickHouse migrations are managed separately via golang-migrate.
CREATE TABLE tokens ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), workspace_id UUID NOT NULL REFERENCES workspaces(id), project_id UUID REFERENCES projects(id), application_id UUID REFERENCES applications(id), environment_id UUID REFERENCES environments(id), kind TEXT NOT NULL, -- ingest | report | admin | cli token_hash TEXT NOT NULL UNIQUE, -- argon2/sha256 of bearer allowed_services JSONB, -- services this token may write rate_limit_rps INT, quota_bytes_day BIGINT, expires_at TIMESTAMPTZ, revoked_at TIMESTAMPTZ, created_at TIMESTAMPTZ NOT NULL DEFAULT now());CREATE INDEX idx_tokens_workspace ON tokens(workspace_id);CREATE INDEX idx_tokens_kind ON tokens(kind) WHERE revoked_at IS NULL;CREATE TABLE spans ( workspace_id UUID, project_id UUID, application_id UUID, environment_id UUID, service_id UUID, service_name LowCardinality(String), trace_id UUID, span_id UInt64, parent_span_id UInt64, name LowCardinality(String), kind LowCardinality(String), status_code LowCardinality(String), started_at DateTime64(9), duration_ns UInt64, attributes Map(LowCardinality(String), String))ENGINE = MergeTreePARTITION BY (workspace_id, toYYYYMMDD(started_at))ORDER BY (workspace_id, application_id, started_at, trace_id)TTL started_at + INTERVAL 30 DAY;Auth & Tenant Boundary
Workspace admins authenticate through an external OIDC provider; the platform's bearer tokens are the source of truth for machine identity (SDKs, CLIs, embedded admin panels). token_hash is stored, never the bearer secret. The Gateway holds resolved scopes in Redis keyed by hash with a 60s TTL; revocation publishes TOKEN_REVOKED so the cache invalidator drops the entry within seconds.
| Token kind | Ingest | Reports | Settings |
|---|---|---|---|
| ingest | ✓ | ✗ | ✗ |
| report | ✗ | ✓ | ✗ |
| admin | ✗ | ✓ | ✓ |
| cli | scoped by role | scoped by role | scoped by role |
- Token layer: every bearer resolves to exactly one workspace; no cross-workspace token can be issued.
- Gateway layer: tenant-labeled fields in the payload are overwritten with values resolved from the token.
- Query layer: every Report Engine query injects workspace_id (and application_id where relevant) from the token, never from request parameters. ClickHouse partitioning by workspace_id enforces this physically on disk.
Reliability & Failure Modes
The system is designed under the assumption that SDKs send malformed telemetry, brokers duplicate messages, caches fail, storage backends time out, and traffic is uneven across tenants.
- ClickHouse failure: Gateway buffers to a local WAL; on recovery the pipeline drains with rate limiting. WAL >70% sheds low-priority logs; >90% increases trace sampling; error signals are never shed.
- Redis failure: Gateway falls through to PostgreSQL for token validation; quota enforcement degrades to coarse-grained per-process counters until Redis returns.
- Broker failure: telemetry path does not depend on the bus; control-plane consumers may replay, and idempotent handling prevents duplicate effects.
- SDK misbehavior: per-tenant cardinality runaway is rejected with CARDINALITY_REJECTED; a noisy service is throttled at the token level before pressure reaches shared infrastructure.
- Circuit breakers wrap external dependencies (object store, optional embedding APIs) so upstream slowness does not cascade.
Operational Scenarios
First Go monolith onboarding
Developer creates Workspace, Project, Application, Environment, and Services in the UI. Control plane issues an ingest token and a report token, each returned exactly once. Developer installs the Go SDK, adds the init block, the HTTP middleware, the worker wrapper, and SANKETLENS_ENDPOINT + SANKETLENS_TOKEN env vars. On startup the SDK performs a setup-validation handshake; reports appear within 60 seconds. End-to-end setup under 10 minutes with no local Prometheus / Grafana / Collector / Tempo / Loki.
Leaked ingest token used to read reports
Attacker calls ReportService.GetHealth with a leaked ingest token. The Report Service interceptor validates token kind and rejects with PermissionDenied. Audit log captures token id, source IP, and timestamp; operator rotates the token and the Gateway cache invalidator drops the entry within seconds. Token-type separation contains the leak.
ClickHouse degraded latency
ClickHouse enters a compaction-induced slowdown; insert latency rises from 50ms to 4s. The Ingest Gateway continues to accept telemetry and writes to the local WAL — acks stay fast because they are gated on WAL write, not ClickHouse confirmation. When WAL utilization crosses 70%, low-priority logs begin shedding; at 90%, traces sample more aggressively; error-level signals are never shed. SDK-visible behavior is unaffected for the first several minutes.
Tradeoffs Summary
- Modular monolith vs microservices: faster early development, with an operator-driven extraction path when ingestion or report latency thresholds are sustained.
- gRPC-only ingestion vs broad adoption: stronger contracts and streaming, with OTLP-HTTP reserved as a v3 compatibility mode rather than a parallel first-class path.
- Shared multi-tenant backend vs per-tenant storage: lower cost and easier ops, compensated by three layers of isolation (token, gateway, query) plus physical partitioning in ClickHouse.
- Opinionated reports vs raw dashboards: less flexibility for power users, dramatically lower cognitive cost for the target developer; internal Grafana exists only for operator debugging.
- SDK depth first, language breadth later: each official SDK must reach 'boring setup' quality (install, init, middleware, restart) before being claimed as supported.
- Reliability complexity vs operational safety: explicit idempotency, retries, DLQs, circuit breakers, WAL buffering, and multi-level rate limiting from the beginning — predictable behavior under failure instead of ad hoc operator intervention.