๐Ÿ—œ๏ธ Squeezed Signals

The Evolution of Observability Data Storage

A deep dive into compressing metrics, logs, and traces

Press Space to navigate โ†’

๐Ÿ“‹ Agenda (45 minutes)

  1. Introduction - The observability data explosion (5 min)
  2. Metrics - Time-series compression (12 min)
  3. Logs - Structured text compression (12 min)
  4. Traces - Distributed execution compression (10 min)
  5. Reality Check - Trade-offs & practical considerations (6 min)

Part 1: Introduction

The Observability Data Problem

๐Ÿ”ญ What is Observability?

Observability = Understanding what's happening inside your systems

๐Ÿ“Š Metrics

Numerical measurements

  • CPU usage: 72.3%
  • Requests/sec: 1,247
  • Memory: 4.2 GB

๐Ÿ“ Logs

Text event records

  • "User logged in"
  • "Error: Connection timeout"
  • "Request processed in 42ms"

๐Ÿ” Traces

Distributed request flows across microservices

api-gateway โ†’ user-service โ†’ auth-service โ†’ database

๐ŸŽฏ Project Goals

Primary Goal

Demonstrate progressive compression techniques from naive approaches to production-grade algorithms

โœ… What We'll Cover

  • Lossless compression
  • Domain-specific techniques
  • Real-world algorithms
  • Measurable improvements

โš ๏ธ Important Note

  • Focus: Storage size
  • Not covered deeply: Query speed, Ingest CPU
  • These matter in production!

๐Ÿ“Š Overview

Signal Type Starting Size Key Challenge
๐Ÿ“Š Metrics 80.8 MB NDJSON Temporal patterns in numbers
๐Ÿ“ Logs 72.7 MB Plain Text Repeated structure with variables
๐Ÿ” Traces 71.1 MB NDJSON Parent-child relationships

Each uses different strategies tailored to data characteristics

Part 2: Metrics

๐Ÿ“Š Time-Series Compression Journey

What Are Metrics?

Time-series numerical data

Regular intervals, labeled dimensions

{
  "metric_name": "system.cpu.user",
  "labels": {
    "host": "server-01",
    "datacenter": "us-west-1"
  },
  "timestamp": 1698000000,
  "value": 72.34567
}

Key characteristics:

  • Timestamps are regular (e.g., every 60 seconds)
  • Values change slowly (temporal locality)
  • Labels repeat across many series

๐Ÿ“ˆ Metrics: Progressive Journey

We'll explore 6 compression phases

Each phase builds on the previous, demonstrating how domain knowledge compounds with general techniques

Starting point: 80.8 MB NDJSON

Each phase will reveal new techniques and improvements

Phase 1: NDJSON Baseline (80.8 MB, 1.0x)

Newline-Delimited JSON

{"metric_name":"system.cpu.user","labels":{"host":"server-01","datacenter":"us-west-1"},"timestamp":1698000000,"value":72.34567}
{"metric_name":"system.cpu.user","labels":{"host":"server-01","datacenter":"us-west-1"},"timestamp":1698000060,"value":72.34589}
{"metric_name":"system.cpu.user","labels":{"host":"server-01","datacenter":"us-west-1"},"timestamp":1698000120,"value":72.34612}

Observations:

  • Text-based, human-readable format
  • Repeated keys: "metric_name", "labels", "timestamp", "value"
  • Same metric name and labels repeated thousands of times
  • Timestamps are sequential (60 second intervals)
  • Values change slowly (temporal locality)

Phase 2: CBOR Binary Encoding (63.9 MB, 1.26x, ฮ”1.26x)

Switch to binary format

# JSON (text):
{"timestamp": 1698000000}
# 25 bytes

# CBOR (binary):
A1                        # map(1 item)
  69 74696D657374616D70   # "timestamp" (9 bytes)
  1A 6536CD80             # unsigned(1698000000) (5 bytes)
# 15 bytes (40% smaller!)

# Benefits:
# - Integers stored as binary, not strings
# - Type information embedded efficiently
# - No whitespace or quotes needed
# - Still preserves all structure

โœ… 1.26x compression through binary encoding

Phase 3: CBOR + zstd Compression (3.8 MB, 21.3x, ฮ”16.8x)

Add general-purpose compression

# zstd dictionary learning discovers patterns:

Pattern 1: "system.cpu.user" (appears 10k times)
  โ†’ Dictionary entry #1 (2 bytes to reference)

Pattern 2: "server-01" (appears 10k times)
  โ†’ Dictionary entry #2

Pattern 3: Sequential timestamps
  โ†’ Run-length encoding finds patterns

Pattern 4: Label structure repetition
  โ†’ Reference previous occurrences

# Result: 63.9 MB โ†’ 3.8 MB (16.8x additional compression!)
# Combined: 80.8 MB โ†’ 3.8 MB (21.3x total)

โœ… Biggest single jump: 16.8x improvement from zstd!

Phase 4: Binary Table with String Deduplication (2.8 MB, 28.9x, ฮ”1.36x)

Key Insight: String Deduplication

# Before: Each row stores full strings
{name: "system.cpu.user", labels: {host: "server-01", dc: "us-west-1"}}
{name: "system.cpu.user", labels: {host: "server-01", dc: "us-west-1"}}
# "system.cpu.user" repeated 1000 times = 17,000 bytes

# After: Store strings once, use IDs
strings: ["system.cpu.user", "server-01", "us-west-1"]
rows: [{name_id: 0, label_ids: [1, 2]}, {name_id: 0, label_ids: [1, 2]}]
# 17 bytes + (1000 ร— 3 bytes) = 3,017 bytes

# Savings: 5.6x just from deduplication!

โœ… 28.9x total - deduplication enables better zstd compression

Phase 5: Columnar Storage (2.0 MB, 40.4x, ฮ”1.40x)

Key Insight: Group Similar Data

Row-oriented (Before)

row1: {ts: 1000, val: 72.3}
row2: {ts: 1060, val: 72.4}
row3: {ts: 1120, val: 72.5}

Mixed types, poor compression

Column-oriented (After)

timestamps: [1000, 1060, 1120]
values:     [72.3, 72.4, 72.5]

Similar values together = better compression

Why it works:

  • All timestamps together โ†’ delta encoding applies better
  • All values together โ†’ XOR/floating-point compression more effective
  • All string IDs together โ†’ run-length encoding finds patterns

โœ… 40.4x compression - columnar layout unlocks specialized algorithms

Phase 6: Pattern-Aware Algorithms (1.0 MB, 79.7x, ฮ”1.97x)

The Breakthrough: Understand the Data!

Detect patterns and apply specialized compression:

  • Constant values: Store once + count (โˆž compression!)
  • Near-constant: Base value + tiny deltas
  • Power-of-2: Store exponents instead of values
  • Mostly integers: Split integer/fractional parts
  • Periodic patterns: Template + deviations
  • Sparse data: Store only non-zero indices + values

Result: 3.19 bytes per data point (vs 8 bytes original) - 79.7x compression!

Deep Dive: Near-Constant Encoding

Exploit Values That Rarely Change

Many metrics are nearly constant with tiny deviations

# CPU usage values (varies slightly around 72%):
values = [72.34567, 72.34589, 72.34612, 72.34599, 72.34601, ...]
# Raw: 1000 values ร— 8 bytes = 8,000 bytes

# Near-constant encoding:
base_value = 72.34567
deltas = [0.00000, 0.00022, 0.00045, 0.00032, 0.00034, ...]

# Deltas are tiny! Scale and store as small integers:
scaled_deltas = [0, 22, 45, 32, 34, ...]  # ร— 0.00001
# Most deltas fit in 1-2 bytes instead of 8!

# Result:
# - Base value: 8 bytes
# - 1000 deltas ร— ~1.5 bytes = 1,500 bytes
# - Total: 1,508 bytes vs 8,000 bytes original
# - Compression: 5.3x!

# Even better with zstd: deltas have patterns too!

Deep Dive: Delta Encoding

Store Differences, Not Absolute Values

# Regular timestamps:
timestamps = [1000, 1060, 1120, 1180, 1240]
# Each: 8 bytes = 40 bytes total

# Delta encoding:
first = 1000
deltas = [60, 60, 60, 60]
# first (8 bytes) + deltas (4ร—2 bytes) = 16 bytes

# Double-delta (deltas of deltas):
deltas = [60, 60, 60, 60]
delta_deltas = [0, 0, 0]  # All zeros!
# Run-length encode: "60 repeated 4 times"
# Result: 12 bytes (3x compression!)

Metrics: Key Takeaways

โœ… What Worked

  • Binary encoding - Simple, 1.3x gain
  • Generic compression - zstd gives 21x without domain knowledge
  • Columnar storage - Group similar data types
  • Pattern detection - Specialized algorithms per pattern type

๐ŸŽฏ The Progression

1x (JSON) โ†’ 21x (+ zstd) โ†’ 40x (+ columnar) โ†’ 80x (+ patterns)

Each technique builds on previous work!

Part 3: Logs

๐Ÿ“ Structured Text Compression

What Are Logs?

Semi-structured text event records

[Thu Jun 09 06:07:04 2005] [notice] LDAP: Built with OpenLDAP LDAP SDK
[Thu Jun 09 06:07:05 2005] [error] env.createBean2(): Factory error creating vm
[Thu Jun 09 06:07:19 2005] [notice] jk2_init() Found child 2330 in scoreboard slot 0
[Thu Jun 09 06:07:20 2005] [error] [client 10.0.0.153] File does not exist: /var/www/html
[Thu Jun 09 06:07:21 2005] [error] [client 10.0.0.153] Directory index forbidden

Key characteristics:

  • Repeated templates with variable values
  • Variable types: timestamps, IPs, IDs, numbers, paths
  • Massive redundancy in structure

๐Ÿ“ˆ Logs: Progressive Journey

We'll explore 6 compression phases

From plain text through advanced log-specific techniques

Dataset: OpenSSH server logs (655,147 lines)

Starting point: 72.7 MB plain text

Each phase will demonstrate new optimizations

Phase 1: Plain Text Baseline (72.7 MB, 1.0x)

Raw OpenSSH log files

Dec 10 06:55:46 LabSZ sshd[24200]: reverse mapping checking getaddrinfo for ns.marryaldkfaczcz.com [173.234.31.186] failed - POSSIBLE BREAK-IN ATTEMPT!
Dec 10 06:55:46 LabSZ sshd[24200]: Invalid user webmaster from 173.234.31.186
Dec 10 06:55:46 LabSZ sshd[24200]: input_userauth_request: invalid user webmaster [preauth]
Dec 10 06:55:46 LabSZ sshd[24200]: pam_unix(sshd:auth): check pass; user unknown
Dec 10 06:55:46 LabSZ sshd[24200]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=173.234.31.186
Dec 10 06:55:48 LabSZ sshd[24200]: Failed password for invalid user webmaster from 173.234.31.186 port 38926 ssh2
Dec 10 06:55:48 LabSZ sshd[24200]: Connection closed by 173.234.31.186 [preauth]
Dec 10 07:02:47 LabSZ sshd[24203]: Connection closed by 212.47.254.145 [preauth]

Observations:

  • 655,147 lines, each ~110 bytes average
  • Repeated structure: "timestamp hostname daemon[pid]: message"
  • Same message patterns with different variables
  • "Failed password for root" appears 139,818 times!

Phase 2: Plain Text + zstd (3.0 MB, 24.5x, ฮ”24.5x)

General compression finds patterns automatically

# zstd dictionary learning discovers repeated patterns:

Pattern 1: "Dec 10 " (appears 655k times)
  โ†’ Store once in dictionary, reference with 2 bytes

Pattern 2: " LabSZ sshd[" (appears 655k times)
  โ†’ Dictionary entry

Pattern 3: "Failed password for root from " (appears 140k times)
  โ†’ Dictionary entry

Pattern 4: "authentication failure" (appears 153k times)
  โ†’ Dictionary entry

# Compression result:
Original: 72,746,715 bytes
Compressed: 2,968,978 bytes
Ratio: 24.5x

โœ… 24.5x with zero domain knowledge - impressive!

Phase 3: Template Extraction (2.0 MB, 36.3x, ฮ”1.48x)

The CLP Breakthrough: Separate Structure from Data!

# Original log lines (3 examples):
Dec 10 06:55:48 LabSZ sshd[24200]: Failed password for root from 173.234.31.186 port 38926 ssh2
Dec 10 07:14:32 LabSZ sshd[24205]: Failed password for root from 218.65.30.43 port 54913 ssh2
Dec 10 08:03:17 LabSZ sshd[24220]: Failed password for root from 61.174.51.214 port 58389 ssh2

# CLP separates into template + variables:
Template #16: " LabSZ sshd[]: Failed password for root from  port  ssh2"

Variables (columnar storage):
  TIMESTAMP: ["Dec 10 06:55:48", "Dec 10 07:14:32", ...]
  NUM (pid): [24200, 24205, 24220, ...]
  IP: ["173.234.31.186", "218.65.30.43", "61.174.51.214", ...]
  NUM (port): [38926, 54913, 58389, ...]

Line-to-template mapping: [16, 16, 16, 16, 16, ...]  # Just template IDs!

# This template used 139,818 times!
Template storage: 81 bytes ร— 1 = 81 bytes
Variables: 139,818 ร— (time + pid + IP + port) โ‰ˆ much smaller with type-aware compression

Phase 3: Template Statistics

Only 5,669 unique templates for 655,147 lines!

Template Count % Reuse
Failed password for root from <IP> port <NUM> ssh2 139,818 21.3% 139,818x
authentication failure; ... rhost=<IP> user=root 139,572 21.3% 139,572x
Connection closed by <IP> [preauth] 68,958 10.5% 68,958x
Received disconnect from <IP>: <NUM>: Bye Bye [preauth] 46,593 7.1% 46,593x
PAM service(sshd) ignoring max retries; <NUM> > <NUM> 37,963 5.8% 37,963x
...and 5,664 more templates
Average template reuse: 655,147 lines รท 5,669 templates = 116x per template!

Phase 4: Type-Aware Variable Encoding (1.9 MB, 38.7x, ฮ”1.07x)

Compress each variable type optimally

# TIMESTAMP variables (655,147 occurrences):
Raw: ["Dec 10 06:55:46", "Dec 10 06:55:46", "Dec 10 06:55:48", ...]
Delta encoded:
  Base: "Dec 10 06:55:46"
  Deltas: [0, 0, 2, 1, 58, 0, ...]  # Seconds difference
  Compression: 12,447,793 bytes โ†’ efficient storage (using numpy arrays)

# IP variables (549,454 occurrences):
Raw: ["173.234.31.186", "173.234.31.186", "212.47.254.145", ...]
Integer encoding + Dictionary:
  Convert to uint32: [2910973370, 2910973370, 3561263233, ...]
  Dictionary + indices for repeated IPs
  Compression: ~6.8 MB โ†’ 2.2 MB (3.1x)

# NUM (port/pid) variables (1,458,375 occurrences):
Raw: [24200, 38926, 24203, 54913, 24220, 58389, ...]
Numpy array (efficient integer storage):
  Compression: Uses compact numpy dtype
  Size: 11.7 MB (efficient integer packing)

# PATH variables (11 occurrences):
Raw: ["/var/log/secure", "/etc/ssh/sshd_config", ...]
Dictionary with prefix compression: 232 bytes

Phase 5: Smart Row Ordering (1.9 MB, 38.7x, ฮ”1.00x)

# Strategy: Group by template, sort within groups by time
# Before (chronological, interleaved):
Line 1000: Template 16 "Failed password for root"
Line 1001: Template 6  "Connection closed"
Line 1002: Template 16 "Failed password for root"
Line 1003: Template 21 "authentication failure"
Line 1004: Template 16 "Failed password for root"
# Poor compression - similar patterns scattered

# After (grouped by template):
Template 16: "Failed password for root"  ร— 139,818 lines
  Dec 10 06:55:48 LabSZ sshd[24200]: ... from 173.234.31.186 port 38926 ssh2
  Dec 10 07:14:32 LabSZ sshd[24205]: ... from 218.65.30.43 port 54913 ssh2
  Dec 10 08:03:17 LabSZ sshd[24220]: ... from 61.174.51.214 port 58389 ssh2
  ... (all similar logs grouped)

Template 21: "authentication failure" ร— 139,572 lines
Template 6: "Connection closed" ร— 68,958 lines

# Benefits:
# - Similar variables adjacent โ†’ better delta encoding
# - Template IDs grouped โ†’ better RLE compression
# - zstd finds longer matching patterns

Phase 5: Why No Improvement?

โš ๏ธ Preserving original order is expensive!

How Order Mapping Works:

  • Raw data: 655,147 uint32 indices (2.6 MB)
  • Delta encoding: Small deltas between reordered positions
  • Varint encoding: Small numbers use fewer bytes
  • Zstd L22: Compresses patterns in deltas
  • Result: 2.6 MB โ†’ 130 KB (20x compression)

Final outcome: 1.895 MB โ†’ 1.897 MB (0.999x) - essentially no change because the ordering overhead (130 KB) is offset by better template grouping compression.

Solution: Phase 6 drops order preservation to get the full benefit!

Phase 6: Drop Order Preservation (1.77 MB, 41.2x, ฮ”1.07x)

Remove line ordering for maximum compression

# Before (Phase 5): Preserve line numbers
{
  "templates": [...],
  "line_to_template": [16, 16, 16, 6, 21, 16, ...],  # 655,147 entries
  "variables_per_line": {
    0: {"time": "...", "pid": ..., "ip": "...", "port": ...},
    1: {"time": "...", "pid": ..., "ip": "...", "port": ...},
    ...
  }
}

# After (Phase 6): No line numbers, just templates + variables
{
  "templates": [...],
  "template_data": {
    16: {  # Template 16: 139,818 occurrences
      "times": ["Dec 10 06:55:48", ...],         # 139,818 values
      "pids": [24200, 24205, ...],               # 139,818 values
      "ips": ["173.234.31.186", ...],            # 139,818 values
      "ports": [38926, 54913, ...]               # 139,818 values
    },
    21: {  # Template 21: 139,572 occurrences
      "times": [...],  # 139,572 values
      "pids": [...],
      "ips": [...],
    },
    ...
  }
}

# Benefits:
# - No need to store 655,147 line-to-template mappings
# - Variables perfectly grouped by template
# - Maximum compression locality
# - Additional 1.07x improvement

Logs: Compression Comparison

Phase Format Size Ratio Key Technique
1 Plain Text 72.7 MB 1.0x None (baseline)
2 Plain + zstd 3.0 MB 24.5x Pattern dictionary
3 Templates + zstd 2.0 MB 36.3x CLP template extraction
4 Type-aware vars 1.9 MB 38.7x Delta/dict per type
5 Smart ordering 1.9 MB 38.7x Group by template
6 No ordering 1.77 MB 41.2x Drop line numbers

Logs: Dataset Comparison

Dataset Lines Size Templates Reuse Final Ratio
Apache (small) 56,482 5.1 MB 38 1,486x 50.7x
HDFS (big) 74,859 10.5 MB 18 4,159x 24.2x
OpenSSH (huge) 655,147 72.7 MB 5,669 116x 41.2x

Why is HDFS worse despite 4,159x template reuse?

  • High-cardinality variables: 74,859 unique block identifiers (296 KB encoded)
  • Lots of numbers: 141,631 numeric values (566 KB)
  • Many unique paths: 58,747 file paths (1.22 MB)
  • Even with timestamp delta encoding: 74,859 timestamps โ†’ 145 KB (6.7x, but still significant)
  • Total variable overhead: 3.9 MB uncompressed (despite only 18 templates!)

Key insight: Template reuse alone isn't enough - variable types, cardinality, and uniqueness dominate!

Logs: Key Takeaways

โœ… CLP Algorithm is the Star

  • Template extraction: Separating structure from data enables massive compression
  • Template reuse: Key metric - OpenSSH achieves 116x average reuse (5,669 templates for 655k lines)
  • Type-aware encoding: Each variable type compressed optimally (3-6x per type)
  • Ordering optimizations: Grouping similar data improves compression locality

๐ŸŽฏ The Progression

1x (plain) โ†’ 24.5x (+ zstd) โ†’ 36.3x (+ templates) โ†’ 38.7x (+ type-aware) โ†’ 41.2x (+ optimizations)

Part 4: Traces

๐Ÿ” Distributed Execution Compression

What Are Traces?

Distributed request execution flows

Track a request across multiple microservices

User Request โ†’ api-gateway (50ms)
                  โ†“
               user-service (30ms)
                  โ†“
               auth-service (20ms)
                  โ†“
               auth-db (10ms)

Each "span" contains:

  • trace_id (links all spans in one request)
  • span_id + parent_span_id (tree structure)
  • service_name, operation_name
  • start_time, end_time (nanoseconds)
  • tags, logs, status

๐Ÿ“ˆ Traces: Progressive Journey

We'll explore 5 compression phases

From JSON through trace structure exploitation

Dataset: 50,000 traces, 141,531 spans, 12 services

Starting point: 71.1 MB NDJSON

Each phase will reveal new structural optimizations

Phase 0: Original JSON (71.1 MB)

OpenTelemetry-style distributed trace data

{
  "trace_id": "trace-00000001-df734723",
  "spans": [{
    "trace_id": "trace-00000001-df734723",
    "span_id": "666ffb28a8ca45d7",
    "parent_span_id": null,
    "operation_name": "authenticate",
    "service_name": "api-gateway",
    "start_time": 1761416099523944960,
    "end_time": 1761416099585944960,
    "duration": 62000000,
    "tags": {
      "service.name": "api-gateway",
      "http.method": "DELETE",
      "http.status_code": 201,
      "user.id": "user-2256"
    },
    "logs": [],
    "status_code": 0
  }]
}

โŒ Verbose: Every field spelled out, repeated service/operation names

Phase 1: NDJSON Baseline (71.1 MB, 1.0x)

One span per line (newline-delimited JSON)

{"trace_id":"trace-00000001-df734723","span_id":"666ffb28a8ca45d7",...}
{"trace_id":"trace-00000001-df734723","span_id":"8a89dc3a1a204434",...}
{"trace_id":"trace-00000001-df734723","span_id":"b2deb5edd9a74b4f",...}
{"trace_id":"trace-00000002-a9549627","span_id":"868b3925a9374271",...}

Characteristics:

  • 141,531 spans = 141,531 lines
  • Text-based, human-readable
  • Easy to process line-by-line
  • Repeated keys: "trace_id", "span_id", "service_name", etc.

โœ… Baseline established: 71.1 MB

Phase 2: CBOR Binary (39.6 MB, 1.79x, ฮ”1.79x)

Binary encoding with type tags

# JSON (text):
{"trace_id": "trace-00000001-df734723", "span_id": "666ffb28a8ca45d7"}
# 71 bytes

# CBOR (binary):
A2                        # map(2 items)
  68 74726163655F6964     # "trace_id" (8 bytes)
  78 1C 74726163652D...   # "trace-00000001-df734723" (28 bytes)
  67 7370616E5F6964       # "span_id" (7 bytes)
  70 363636666662323...   # "666ffb28a8ca45d7" (16 bytes)
# 39 bytes (44% smaller!)

# Still verbose: All keys/values in every span
# But binary integers, efficient string encoding

โœ… 1.79x compression through binary encoding

Phase 3: CBOR + zstd (5.6 MB, 12.6x, ฮ”7.04x)

General-purpose compression on binary data

# zstd identifies patterns:
Pattern 1: "trace_id" appears 141,531 times
  โ†’ Dictionary entry #1 (4 bytes to reference)

Pattern 2: "api-gateway" appears 50,000 times
  โ†’ Dictionary entry #2

Pattern 3: Sequential timestamps
  โ†’ Run-length encoding

Pattern 4: Repeated tag structures
  โ†’ Reference previous occurrence

# Result: 39.6 MB โ†’ 5.6 MB (7x additional compression)

โœ… 12.6x total compression - biggest single jump!

Phase 4: Span Relationships (2.3 MB, 30.5x, ฮ”2.43x)

The Breakthrough: Exploit Trace Structure!

# Before: Every span stores full identifiers
{
  "trace_id": "trace-00000001-df734723",  # 28 bytes
  "span_id": "666ffb28a8ca45d7",          # 16 bytes
  "parent_span_id": "8a89dc3a1a204434",   # 16 bytes
  "service_name": "api-gateway",          # 11 bytes
  "operation_name": "authenticate"        # 12 bytes
}

# After: Dictionary + sequential IDs
{
  "services": ["api-gateway", "user-service", ...],  # 12 strings total
  "operations": ["authenticate", "route_request", ...],  # 12 strings
  "traces": [
    {
      "trace_id": "trace-00000001-df734723",
      "spans": [
        {"svc": 0, "op": 0, "parent": -1, "duration": 62000000},  # Root
        {"svc": 1, "op": 2, "parent": 0, "duration": 97000000},   # Child of 0
        {"svc": 2, "op": 5, "parent": 1, "duration": 31000000}    # Child of 1
      ]
    }
  ]
}

Phase 4: Relationship Compression Details

Service Deduplication

# Before:
"api-gateway" ร— 50,000 = 550 KB
"user-service" ร— 8,457 = 93 KB
Total: ~1.5 MB

# After:
12 service names = 150 bytes
50K IDs ร— 1 byte = 50 KB
Total: 50 KB (30x!)

Parent Relationships

# Before:
UUID parent_span_id ร— 141K
= 2.3 MB

# After:
Sequential index ร— 141K
= 141 KB (16x!)

Root: parent = -1
Others: parent = 0-9

Timestamp Delta Encoding

# Root span: 1761416099523944960 (store full timestamp)
# Child span 1: +10,000,000 (10ms later, store delta)
# Child span 2: +20,000,000 (20ms after child 1)
# Result: 2-3 bytes per delta vs 8 bytes absolute

โœ… 30.5x compression through structure exploitation!

Phase 5: Columnar Storage (1.9 MB, 37.3x, ฮ”1.22x)

Separate data by type for better compression

# Before (row-oriented):
span1: {duration: 62000000, status: 0, parent: -1}
span2: {duration: 97000000, status: 0, parent: 0}
span3: {duration: 31000000, status: 0, parent: 1}
# Mixed types, poor compression locality

# After (columnar):
{
  "durations": [62000000, 97000000, 31000000, ...],  # All similar values
  "status_codes": [0, 0, 0, 0, 1, 0, ...],           # Mostly zeros
  "parent_indices": [-1, 0, 1, 2, 1, 0, ...],        # Small integers
  "span_positions": [4, 9, 1, 1, 1, 9, 9, ...]       # Spans per trace
}

# Compressed columnar arrays:
# - durations: 708KB โ†’ 308KB (delta encoding + zstd)
# - status_codes: 141KB โ†’ 4KB (run-length: mostly zeros!)
# - parent_indices: 141KB โ†’ 11KB (small integers compress well)

โœ… 37.3x total compression - columnar wins!

Traces: Compression Comparison

Phase Format Size Ratio Key Technique
0 Original JSON 71.1 MB 1.0x None (baseline)
1 NDJSON 71.1 MB 1.0x Line-delimited format
2 CBOR 39.6 MB 1.79x Binary encoding
3 CBOR + zstd 5.6 MB 12.6x Dictionary compression
4 Relationships 2.3 MB 30.5x Service dict + sequential IDs
5 Columnar 1.9 MB 37.3x Type-grouped arrays

From 71.1 MB โ†’ 1.9 MB with structured exploitation!

Dataset: 50,000 traces ร— 2.8 avg spans = 141,531 total spans

Traces: Key Takeaways

โœ… Structure Exploitation Wins

  • Service topology: 12 services across 141K spans โ†’ 22x deduplication
  • Parent-child relationships: Sequential IDs vs UUIDs โ†’ 16x compression
  • Timestamp locality: Delta from root โ†’ 2-4x compression
  • Columnar arrays: Group similar types โ†’ 1.2x additional improvement

๐ŸŽฏ The Progression

1x (NDJSON) โ†’ 1.8x (+ binary) โ†’ 12.6x (+ zstd) โ†’ 30.5x (+ relationships) โ†’ 37.3x (+ columnar)

๐Ÿ“Š Format Choice Matters

Relationship format for storage, Columnar format for analytics

Part 5: Reality Check

โš–๏ธ Trade-offs & Practical Considerations

โš ๏ธ The Compression Trilemma

You Can't Optimize Everything!

๐Ÿ’พ Storage Size

Minimize disk usage

โœ… Focus of this project

โšก Ingest Speed

Fast data writing

โš ๏ธ CPU cost per write

๐Ÿ” Query Speed

Fast data reading

โš ๏ธ Decompression overhead

Not always, but very often tradeoff desicions have to be made - optimizing one hurts the others

๐Ÿ’ฐ Real-World Trade-offs

๐Ÿ’พ Optimize Storage

Analyze statistics, pick best algorithm

Example: Parquet - auto-selects dictionary/RLE/bit-packing per column based on cardinality and patterns

Cost: Slow ingest & queries

โšก Optimize Ingest Speed

Write append-only, compress in background

Example: Grafana Loki - only lightweight general compression, no columnarization, no merging of chunks

Cost: Higher storage, write amplification

๐Ÿ” Optimize Query Speed

Build indexes to skip decompression

Example: Elasticsearch inverted indexes - skip 99% of documents without scanning

Cost: 5-15% storage overhead

Key insight: Production systems use tiered storage - different techniques for hot (recent), warm (weeks), and cold (archive) data

๐Ÿญ Production System Examples

Prometheus (Metrics)

  • Uses XOR compression (Gorilla algorithm) + delta encoding
  • Compression: ~10-20x (balanced approach)
  • Query speed: Milliseconds for recent data
  • Trade-off: Moderate compression for fast queries

Elasticsearch (Logs)

  • Uses Lucene inverted indexes + LZ4/DEFLATE compression
  • Compression: ~5-15x (optimized for search speed)
  • Query speed: Milliseconds with skip lists and indexes
  • Trade-off: Lower compression for fast full-text search

Jaeger/Tempo (Traces)

  • Use Parquet columnar format for storage
  • Compression: 10-30x depending on data
  • Trade-off: Optimized for querying by service/operation

๐ŸŽ“ Key Lessons Learned

โœ… Domain Knowledge is Powerful

Understanding data characteristics enables order of magnitude better compression than generic algorithms alone

๐Ÿ“ The Compression Hierarchy

  1. Binary encoding - Easy win, 1.3x
  2. Generic compression - zstd gives 10-30x baseline
  3. Structure optimization - Deduplication, columnar: +2-3x
  4. Domain-specific algorithms - Pattern-aware: +2-4x

โš ๏ธ No Free Lunch

Higher compression = Higher CPU cost. Choose based on your access patterns and scale.

๐Ÿ”ฌ Try It Yourself!

The squeezed-signals Repository

github.com/flash1293/squeezed-signals

# Clone and set up
git clone https://github.com/flash1293/squeezed-signals.git
cd squeezed-signals
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Try metrics compression
cd metrics
python main.py --size small

# Try logs compression  
cd ../logs
python main.py --size small

# Try traces compression
cd ../traces
python main.py --size small

Summary

Compression Achievements

  • ๐Ÿ“Š Metrics: 84 MB โ†’ 1 MB (79.7x) via pattern-aware algorithms
  • ๐Ÿ“ Logs: 72.7 MB โ†’ 1.77 MB (41.2x) via template extraction (CLP)
  • ๐Ÿ” Traces: 71.1 MB โ†’ 1.9 MB (37.3x) via relationship encoding

The Universal Principle

Understand your data โ†’ Apply specialized techniques โ†’ Compound with general compression

Questions?

๐Ÿ—œ๏ธ github.com/flash1293/squeezed-signals

Press ? for keyboard shortcuts