Lighthouse Analytics
Tech lead · 9 months · team of 3
The Problem
The existing analytics service was choking. Dashboards polled Postgres every 30 seconds; queries took 8–12s; engineers had limited the product to 50 dashboards per workspace just to keep things upright. Customers wanted realtime; what they had was hourly batch dressed up as live.
Approach
- 1
Migrated the hot path off Postgres onto ClickHouse with materialised views per common cardinality.
- 2
Wrote a small query planner that recognises the top 80% of query shapes and pre-aggregates them at write time.
- 3
Replaced polling with SSE streams, with smart cache invalidation driven by NOTIFY channels.
- 4
Built a synthetic load test rig so we could measure regressions in CI, not in prod.
Outcomes
p95 dashboard load
3.2s→480ms
p99 query latency
11s→1.4s
Events ingested / day
800k→12M
Dashboards per workspace
50→500+
Context
Lighthouse Analytics started as an internal observability tool for a fintech client. The dashboard team had patched together a Postgres-backed reporting layer over two years, and it worked — until it didn't. The product grew faster than the schema could accommodate: new event types, higher cardinality dimensions, and pressure from sales to surface realtime data that customers could act on, not just review.
When I joined as tech lead, the system was ingesting around 800,000 events per day and struggling. Queries spanning more than a week of data could stall the Postgres primary. Engineers had imposed a hard cap of 50 dashboards per workspace to protect database CPU. The product team wanted to remove that cap and eventually offer per-second refresh rates.
The migration to ClickHouse was the structural change, but the real engineering was in the query planner. Analytics workloads follow a Pareto distribution: 20% of query shapes account for 80% of the volume. By identifying those shapes at write time and materialising their aggregations — count, sum, p95, and a handful of custom rollups — we could turn what used to be expensive full-table scans into simple lookups against pre-aggregated tables. The tail queries that didn't match a known shape still hit ClickHouse directly, but ClickHouse handles arbitrary aggregation orders of magnitude faster than Postgres on columnar data.
The SSE streaming layer replaced a polling architecture that was hammering the database every 30 seconds per open dashboard. With NATS as a lightweight pub-sub bus and Postgres NOTIFY triggers feeding it, we could push invalidation events to the front end the moment new data landed — eliminating unnecessary round-trips and making the dashboard feel alive without the engineering cost of a full WebSocket infrastructure.
The outcome numbers — p95 load time dropping from 3.2 seconds to 480 milliseconds, daily ingestion scaling from 800k to 12 million events, the workspace dashboard cap lifted from 50 to 500+ — were validating, but the quieter win was confidence. The engineering team could ship new query types without fear that one bad dashboard would take down the primary database. That calm is worth more than any single latency figure.
Built with Next.js 15, tRPC for the API layer, ClickHouse 24 as the primary analytical store, Postgres 16 for transactional data, Redis for hot-path caching, NATS for event fanout, and deployed on Vercel with edge-cached static assets. The team was three engineers; I designed the architecture, led the ClickHouse migration, and built the query planner and SSE transport layer.
Technology Stack
ClickHouse 24
Columnar store purpose-built for analytical queries. Handles GROUP BY and aggregation on hundreds of millions of rows without breaking a sweat. Materialised views make pre-aggregation automatic.
tRPC
End-to-end type safety between the Next.js API layer and the React frontend. Eliminates an entire class of runtime errors in the query builder, where parameter mismatches would have been silent.
Server-Sent Events
One-way streaming from server to client — simpler to operate than WebSockets for a dashboard use case where the client mostly listens. HTTP/2 multiplexing means no connection-per-tab overhead.
NATS
Lightweight pub-sub bus for internal event fanout. Postgres NOTIFY has hard limits on payload size and subscriber count; NATS handles both gracefully at our event volume.
Redis
Hot-path cache for the most frequently accessed aggregations. Shaves another 50ms off the median response for popular dashboard configurations that haven't changed since the last cache write.
Interested in working together?
Open to full-time remote roles worldwide · Senior IC and tech lead positions · International relocation considered · Full-time or contract · Any time zone. The fastest path is email — I usually reply within a day.