Done
Phase 0 — Project bootstrap
Turn the single-file prototype into a structured project.
Monorepo
Done
pnpm workspaces with apps/web as the main application.
Clean Architecture
Done
domain / data / infrastructure / presentation layers, with typed entities, use cases and repository pattern.
Design system
Done
design-tokens.ts and Tailwind config synced via sync test. Dark mode by default.
Generator tests
Done
Vitest covering GenerateDynamicSimulationUseCase (BFS, fan-out, leaf detection, scenarios).
Current
Phase 1 — Educational experience
Client-side distributed tracing educational app. Interactive canvas, 43 built-in scenarios and local persistence.
Interactive canvas
Done
Drag & drop components, connect nodes, delete by key, reset.
- Render nodes and edges from the store
- RAF animation of edge progress
- Drag to position nodes on the canvas
- Connection mode (click to connect)
- Delete by Backspace/Delete key
Local persistence
In Progress
Custom templates survive refresh via LocalStorage. Built-ins are read-only.
- LocalStorageFlowRepository implemented
- Create custom templates via dialog
- Auto-save when the canvas is modified
- Export / Import templates as .json
Analysis panels
Done
Waterfall, Spans and AI Analysis with severity, tips and fix suggestions.
- Waterfall panel with per-node latency
- Jaeger-style Spans panel
- AI Analysis panel (hardcoded)
Light / Dark mode
To Do
Theme toggle persisted in localStorage. Dark mode styles already in place via Tailwind dark: classes — only the toggle button is missing.
Responsive
To Do
Currently desktop-only (fixed SVG viewBox). Needs canvas adaptation for small screens.
Real AI Analysis
To Do
Hook into Groq or Gemini (free tier) to generate dynamic analysis instead of the hardcoded text.
Current
Components Catalogue (40)
Draggable component library in the sidebar. Grouped by category for reference — the UI renders as a single list.
Clients · Edge · Services · Data · AI · Messaging · External · Observability · Infra · AWS
📱 Clients (2)
Frontend (React), Mobile (iOS/Android)
🚪 Edge / entry (4)
CDN, Load Balancer, API Gateway, Auth
⚙️ Services (5)
Backend, gRPC, GraphQL, WebSocket, Worker
🗄️ Data stores (5)
PostgreSQL, MongoDB, Redis, Elasticsearch, Vector DB
🤖 AI / LLM (1)
LLM API (OpenAI, Groq, Anthropic)
📡 Messaging (3)
Kafka, RabbitMQ, Pub/Sub
💳 External (4)
Stripe, Email API, Twilio, Analytics
📈 Observability (3)
Prometheus (metrics), Jaeger (tracing), Loki (logs)
🔒 Infra support (2)
Vault (secrets), Feature Flags (LaunchDarkly)
🚀 Serverless / AWS (7)
Lambda, DynamoDB, Step Functions, EventBridge, SNS, SQS, S3
🛡️ Security (1)
WAF (Web Application Firewall)
🐳 Containers (3)
Docker, K8s Service, Envoy Sidecar
Current
Templates Catalogue (43)
Each template is a pre-built educational scenario with 3 variants (success / slow / error) and analysis. Built-ins are read-only — to edit, create a custom from the + button in the sidebar.
Simple
Done
Frontend → Backend → PostgreSQL. The "hello world" of distributed tracing — synchronous request, one query, timeout in the error scenario.
- Uses: Frontend, Backend, PostgreSQL
- Success: login in ~140ms, JWT returned
- Slow: slow database query (~2.8s)
- Error: database timeout → 500
Redis Cache
Done
Cache-aside pattern: HIT fetches from Redis (2ms), MISS goes to the database (45ms), DOWN exposes the risk without a circuit breaker.
- Uses: Frontend, Backend, Redis, PostgreSQL
- Success: cache HIT in 8ms, database untouched
- Slow: cache MISS → fallback to DB (~85ms)
- Error: Redis DOWN → all traffic hits the database
Stripe API
Done
External sync API: Stripe approves, database saves. Slow scenario introduces timeout + circuit breaker. Error scenario shows DECLINED without DB consistency.
- Uses: Frontend, Order API, Stripe, PostgreSQL
- Success: charge approved + database updated
- Slow: Stripe latency → circuit breaker trips
- Error: card DECLINED, no DB write (correct)
Kafka Fan-out
Done
Event-driven with 3 parallel consumers (Payment, Inventory, Email). Error scenario is the classic "paid without stock" that motivates saga + compensation.
- Uses: Order API, Kafka, 3 Workers (Payment, Inventory, Email)
- Success: 202 fast, 3 workers process in parallel
- Slow: Payment worker lagging, email sent before charge
- Error: email sent but payment failed — classic saga motivator
SQS Async
Done
API returns 202 fast and a worker processes it later. Teaches that "202 ✓" does not mean "it worked" — the worker may silently fail.
- Uses: Frontend, API, SQS, Worker, Database
- Success: 202 fast, worker processes within 2s
- Slow: worker backlog, processing delayed
- Error: worker crashes silently, client thinks it worked
Upload S3
Done
Partial fan-out: S3 stores + PostgreSQL records metadata. Error scenario shows orphaned data (S3 OK + DB fail) and suggests rollback/cleanup.
- Uses: Frontend, API, S3, PostgreSQL
- Success: file stored + metadata saved
- Slow: slow S3 upload, metadata delayed
- Error: S3 OK but DB fails → orphan file needs cleanup
API Gateway + Microservices
Done
Mobile → API Gateway → 3 microservices (Users, Orders, Products) → 3 databases. Teaches synchronous fan-out, cascading partial failures, per-service timeouts.
- Uses: API Gateway, Mobile, Backend ×3, Database ×3
- Success: all respond in parallel ~200ms
- Slow: Orders service with slow query
- Error: Products service DOWN + cascade timeout
RAG Pipeline
Done
Frontend → Backend → Vector DB (retrieve) + LLM API (generate). Teaches the Retrieval-Augmented Generation pattern and where each step can fail.
- Uses: Backend, Vector DB, LLM API
- Success: retrieval + completion in ~2s
- Slow: LLM API at rate limit (5-10s)
- Error: LLM 429 rate limit / empty Vector DB
Real-time Chat
Done
Mobile → WebSocket → Redis Pub/Sub → MongoDB. Teaches bidirectional communication, pub/sub pattern, async message persistence.
- Uses: Mobile, WebSocket, Redis, MongoDB
- Success: message delivered in < 100ms
- Slow: slow MongoDB write
- Error: Redis down → lost messages
Load Balanced API
Done
Frontend → Load Balancer → Backend ×3 → PostgreSQL. Teaches horizontal scaling, round-robin, and what happens when an instance dies.
- Uses: Load Balancer, Backend ×3, Database
- Success: request routed to healthy instance
- Slow: one instance with high latency
- Error: one instance dead → LB reroutes
Search Sync (Elasticsearch)
Done
Backend writes to PostgreSQL + Elasticsearch (dual-write). Teaches eventual consistency between write model and search index.
- Uses: Backend, PostgreSQL, Elasticsearch
- Success: parallel writes, both consistent
- Slow: Elasticsearch with index lag
- Error: Elasticsearch rejects → divergent state
GraphQL Federation
Done
Frontend → GraphQL Gateway → [Users, Products, Orders] services → Database. Teaches schema federation and resolving queries across multiple services.
- Uses: GraphQL, Backend ×3, Database
- Success: query resolved in ~150ms
- Slow: one slow resolver affects the whole query
- Error: service down → partial data or fail
gRPC Inter-service
Done
External REST API → gRPC Service A → gRPC Service B → Database. Teaches typed internal communication vs external REST.
- Uses: Backend, gRPC ×2, Database
- Success: fast chain via gRPC
- Slow: deadline propagation
- Error: gRPC status codes (UNAVAILABLE, DEADLINE_EXCEEDED)
AI Agent with Tools
Done
Frontend → Backend → LLM API (agentic loop) with tools: Vector DB search + Web search + external API. Teaches the modern agent tool-use pattern.
- Uses: Backend, LLM API, Vector DB, External API
- Success: 2-3 rounds of tool calls, final answer
- Slow: many iterations or slow tool
- Error: LLM falls into infinite loop or tool returns garbage
Saga Pattern
Done
Order → Kafka → [Payment, Inventory, Shipping] with compensation on failure. Teaches distributed transactions via choreography saga — the classic fix for the Kafka fan-out "paid without stock" problem.
- Uses: Mobile, Order API, Kafka, Payment, Inventory, Shipping
- Success: all 3 services confirm in parallel
- Slow: Payment gateway delayed, saga waits
- Error: Inventory OUT_OF_STOCK → Payment.refund compensation
CQRS + Event Sourcing
Done
Command API → Event Store → Projector → Read DB. Teaches write/read separation, append-only event log, and the classic read-lag gotcha.
- Uses: Command API, Event Store, Projector, Read DB
- Success: write + projection in ~850ms
- Slow: projector lagging → stale reads
- Error: projection fails → replay events to rebuild
OAuth 2.0 Login
Done
Browser → Google Auth → Callback → Token exchange → Session. Teaches the Authorization Code flow — the basis of "Sign in with Google/GitHub".
- Uses: Browser, Auth Provider, Backend, Session DB
- Success: code → token → session in ~1.4s
- Slow: /token endpoint rate limited
- Error: invalid_grant (code expired or reused)
Webhook Receiver
Done
Stripe → Verify signature → Idempotency check (Redis) → Process → DB. Teaches HMAC verification, idempotency keys, and handling provider retries.
- Uses: External provider, Webhook API, Redis, Database
- Success: signature valid, new event, processed
- Slow: duplicate event → idempotency no-op (actually good)
- Error: invalid signature → 401 before any processing
Background Jobs
Done
API → Redis Queue (BullMQ) → Worker → DB. Teaches async job processing, queue backpressure, retries and dead-letter queue.
- Uses: API, Redis (BullMQ), Worker, Database
- Success: 202 fast, worker completes in background
- Slow: queue backpressure — 842 jobs pending
- Error: worker crashes → job moved to DLQ after retries
CDC Pipeline (Debezium)
Done
Backend → Postgres → Debezium → Kafka → [Elasticsearch, Redis]. Teaches Change Data Capture — the correct alternative to dual-write for keeping search/cache in sync.
- Uses: Backend, Postgres, Debezium, Kafka, ES, Redis
- Success: single write propagates via CDC
- Slow: replication lag — stale search/cache
- Error: Debezium disconnects → WAL retention saves the day
Observability Stack
Done
Backend → [Prometheus, Jaeger, Loki]. Teaches the 3 signals of observability — metrics, traces, logs — emitted in parallel without blocking the response.
- Uses: Backend, Prometheus, Jaeger, Loki
- Success: all 3 signals stored in parallel
- Slow: Loki batch flush lagging
- Error: Prometheus scrape DOWN, metrics lost
Secrets Management
Done
Backend → Vault (dynamic credentials) → PostgreSQL. Teaches short-lived credentials with automatic rotation via Vault leases.
- Uses: Backend, Vault, PostgreSQL
- Success: dynamic credential issued, 1h TTL
- Slow: Vault under load, 2s to issue
- Error: Vault sealed after restart — backend cannot connect
Feature Flag Rollout
Done
App → API → LaunchDarkly → Backend v1 or v2. Teaches gradual rollout with instant kill switch, separating deploy from release.
- Uses: Frontend, API, LaunchDarkly, Backend v1, Backend v2
- Success: user in 5% rollout, routed to v2
- Slow: flag evaluation adds 1.5s per request
- Error: LaunchDarkly DOWN → fallback to v1 (safe default)
RabbitMQ Work Queue
Done
API → RabbitMQ → 3 competing consumers. Different from Kafka fan-out: each message goes to ONE worker, so adding workers increases throughput linearly.
- Uses: API, RabbitMQ, 3 Workers
- Success: 3 jobs distributed, each to one worker
- Slow: one worker lagging, others compensate
- Error: worker crashes → broker requeues the message
Pub/Sub Multi-region
Done
API → Google Pub/Sub → [US, EU, Asia] subscribers. Teaches geographic fan-out where every subscriber receives every event, isolated per region.
- Uses: API, Pub/Sub, 3 regional subscribers
- Success: event delivered to all 3 regions
- Slow: Asia region with high latency
- Error: Asia subscriber unavailable → Pub/Sub retries
SMS Notification (Twilio)
Done
API → SQS → Worker → Twilio → DB. Teaches async SMS delivery with status tracking and handling provider rate limits.
- Uses: API, SQS, Worker, Twilio, Database
- Success: SMS sent via Twilio, SID returned, status saved
- Slow: Twilio account-wide rate limit kicks in
- Error: invalid phone number → 400, status failed
Product Analytics
Done
Frontend + Backend → Mixpanel (dual emission). Teaches why you need both client-side and server-side tracking to understand user behavior end-to-end.
- Uses: Frontend, Backend, Mixpanel
- Success: click (client) + purchase (server) both tracked
- Slow: Mixpanel ingest slow → backend blocked
- Error: Mixpanel DOWN → events lost without buffer
Lambda + API Gateway
Done
Mobile → API Gateway → Lambda → DynamoDB. Teaches the canonical serverless REST stack: scale-to-zero, billing per ms, cold starts and concurrency limits.
- Uses: Mobile, API Gateway, Lambda, DynamoDB
- Success: warm Lambda, ~300ms total
- Slow: cold start adds ~2s of init time
- Error: account concurrency limit hit → 429 Throttled
S3 Event → Lambda
Done
S3 (PUT) → Lambda trigger → DynamoDB. Teaches event-driven file processing: client gets 200 immediately while Lambda processes in background.
- Uses: Client, S3, Lambda, DynamoDB
- Success: upload + async processing + metadata saved
- Slow: large file (50MB) takes 4.5s to process
- Error: malformed file → Lambda fails 3× → DLQ
Step Functions Saga
Done
API → Step Functions → Lambda chain (Payment, Inventory, Shipping). Teaches orchestrated saga (vs choreography), built-in retries and declarative compensation.
- Uses: API, Step Functions, 3 Lambdas
- Success: 3 steps execute sequentially via state machine
- Slow: Payment timeout → built-in retry succeeds
- Error: Inventory fails → Catch state triggers Payment.refund
EventBridge Routing
Done
App → EventBridge → [Lambda, SQS, SNS] via routing rules. Teaches declarative event bus where producers don't know consumers, with per-target failure isolation.
- Uses: App, EventBridge, Lambda, SQS, SNS
- Success: 1 event routed to 3 targets in parallel
- Slow: Lambda target slow, others isolated
- Error: Lambda fails → DLQ; SQS and SNS still succeed
DynamoDB Single-Table
Done
Mobile → Lambda → DynamoDB single-table. Teaches PK + SK design pattern: query user and orders in one round-trip without JOINs.
- Uses: Mobile, Lambda, DynamoDB
- Success: Query PK=USER#42 SK begins_with ORDER# → 12 items in 60ms
- Slow: hot partition (one PK gets 10× traffic)
- Error: ProvisionedThroughputExceeded → 500
Rate Limiting / DDoS
Done
Client → API → Redis (token bucket) → DB. Teaches per-IP/user rate limiting with Redis, protecting the database from burst and DDoS traffic.
- Uses: Client, API, Redis, PostgreSQL
- Success: 5/100 requests, under limit
- Slow: near limit burst (95/100)
- Error: 429 Too Many Requests — DB untouched
JWT Auth Bypass (IDOR)
Done
Client → API → Auth → Resource. Teaches the difference between Authentication (who you are) and Authorization (what you can do). Valid JWT ≠ allowed action.
- Uses: Client, API, Auth, PostgreSQL
- Success: own resource, authorized
- Slow: JWKS key rotation adds 1.8s
- Error: valid JWT but wrong user → 403 Forbidden
mTLS Service-to-Service
Done
Gateway → Service A → Service B (mTLS) → DB. Teaches mutual TLS: both sides verify certificates, preventing impersonation in internal communication.
- Uses: API Gateway, Service A, Service B, PostgreSQL
- Success: mutual verification, encrypted channel
- Slow: CRL/OCSP check slow during cert rotation
- Error: expired certificate → TLS handshake failed
WAF + OWASP Top 10
Done
Attacker → WAF → API → DB. Teaches perimeter defense: WAF detects SQL injection, XSS and other OWASP attacks, blocking before reaching the backend.
- Uses: Attacker, WAF, API, PostgreSQL
- Success: legitimate request passes WAF rules
- Slow: complex payload → 1.5s rule evaluation
- Error: SQL injection detected → 403 Blocked
Kubernetes Pod + ServiceDone
Client → Ingress → K8s Service → Pod ×3. Teaches service discovery, iptables routing, readiness probes and PodDisruptionBudget.
- Uses: Ingress, K8s Service, 3 Pods
- Success: request routed to healthy pod
- Slow: pod in CrashLoopBackOff still receiving traffic
- Error: all pods down → 503 Service Unavailable
Service Mesh (Istio)Done
Service A → Envoy sidecar → Envoy sidecar → Service B. Teaches transparent mTLS, tracing header injection and circuit breaking via sidecar proxy.
- Uses: Service A, Envoy ×2, Service B, DB
- Success: transparent mTLS + tracing
- Slow: sidecar overhead during config reload
- Error: Envoy circuit breaker opens after 5 DB timeouts
Read ReplicasDone
App → Primary (write) + Replica (read). Teaches replication lag, read-your-writes inconsistency and replica failover.
- Uses: App, Primary DB, Replica DB
- Success: write + read consistent (low lag)
- Slow: replication lag → stale read-your-writes
- Error: replica down → failover to primary
Zero-Downtime DB MigrationDone
App → Old Table + New Table (shadow write). Teaches the dual-write migration pattern used by Stripe and GitHub for zero-downtime schema changes.
- Uses: App, Old Table, New Table
- Success: dual-write active, both tables consistent
- Slow: new table missing index, shadow write takes 2.5s
- Error: schema mismatch → shadow write constraint violation
Database ShardingDone
API → Shard Router → [Shard 1, 2, 3]. Teaches hash partitioning, cross-shard scatter/gather and partial failure when one shard is down.
- Uses: API, Shard Router, 3 Shards
- Success: routed to correct shard in 80ms
- Slow: cross-shard query (scatter/gather)
- Error: one shard down → 33% of users affected
Circuit BreakerDone
Client → Backend → External API. Teaches the three states (Closed → Open → Half-Open) and how fail-fast protects both your service and the downstream.
- Uses: Client, Backend, External API
- Success: CLOSED, requests pass normally
- Slow: HALF-OPEN, testing with one probe
- Error: OPEN, fail fast — external not called
Bulkhead IsolationDone
Frontend → API → [Orders pool, Search pool]. Teaches thread pool isolation: one feature saturated does NOT bring down the others.
- Uses: Frontend, API, Orders Pool, Search Pool
- Success: both pools healthy, independent
- Slow: Search pool saturated (50/50), Orders fine
- Error: Search circuit breaks → orders unaffected