🗜️ Squeezed Signals

The Evolution of Observability Data Storage

A deep dive into compressing metrics, logs, and traces

Press Space to navigate →

📋 Agenda (45 minutes)

Introduction - The observability data explosion (5 min)
Metrics - Time-series compression (12 min)
Logs - Structured text compression (12 min)
Traces - Distributed execution compression (10 min)
Reality Check - Trade-offs & practical considerations (6 min)

Part 1: Introduction

The Observability Data Problem

🔭 What is Observability?

Observability = Understanding what's happening inside your systems

📊 Metrics

Numerical measurements

CPU usage: 72.3%
Requests/sec: 1,247
Memory: 4.2 GB

📝 Logs

Text event records

"User logged in"
"Error: Connection timeout"
"Request processed in 42ms"

🔍 Traces

Distributed request flows across microservices

api-gateway → user-service → auth-service → database

🎯 Project Goals

Primary Goal

Demonstrate progressive compression techniques from naive approaches to production-grade algorithms

✅ What We'll Cover

Lossless compression
Domain-specific techniques
Real-world algorithms
Measurable improvements

⚠️ Important Note

Focus: Storage size
Not covered deeply: Query speed, Ingest CPU
These matter in production!

📊 Overview

Signal Type	Starting Size	Key Challenge
📊 Metrics	80.8 MB NDJSON	Temporal patterns in numbers
📝 Logs	72.7 MB Plain Text	Repeated structure with variables
🔍 Traces	71.1 MB NDJSON	Parent-child relationships

Each uses different strategies tailored to data characteristics

Part 2: Metrics

📊 Time-Series Compression Journey

What Are Metrics?

Time-series numerical data

Regular intervals, labeled dimensions

{
  "metric_name": "system.cpu.user",
  "labels": {
    "host": "server-01",
    "datacenter": "us-west-1"
  },
  "timestamp": 1698000000,
  "value": 72.34567
}

Key characteristics:

Timestamps are regular (e.g., every 60 seconds)
Values change slowly (temporal locality)
Labels repeat across many series

📈 Metrics: Progressive Journey

We'll explore 6 compression phases

Each phase builds on the previous, demonstrating how domain knowledge compounds with general techniques

Starting point: 80.8 MB NDJSON

Each phase will reveal new techniques and improvements

Phase 1: NDJSON Baseline (80.8 MB, 1.0x)

Newline-Delimited JSON

{"metric_name":"system.cpu.user","labels":{"host":"server-01","datacenter":"us-west-1"},"timestamp":1698000000,"value":72.34567}
{"metric_name":"system.cpu.user","labels":{"host":"server-01","datacenter":"us-west-1"},"timestamp":1698000060,"value":72.34589}
{"metric_name":"system.cpu.user","labels":{"host":"server-01","datacenter":"us-west-1"},"timestamp":1698000120,"value":72.34612}

Observations:

Text-based, human-readable format
Repeated keys: "metric_name", "labels", "timestamp", "value"
Same metric name and labels repeated thousands of times
Timestamps are sequential (60 second intervals)
Values change slowly (temporal locality)

Phase 2: CBOR Binary Encoding (63.9 MB, 1.26x, Δ1.26x)

Switch to binary format

# JSON (text):
{"timestamp": 1698000000}
# 25 bytes

# CBOR (binary):
A1                        # map(1 item)
  69 74696D657374616D70   # "timestamp" (9 bytes)
  1A 6536CD80             # unsigned(1698000000) (5 bytes)
# 15 bytes (40% smaller!)

# Benefits:
# - Integers stored as binary, not strings
# - Type information embedded efficiently
# - No whitespace or quotes needed
# - Still preserves all structure

✅ 1.26x compression through binary encoding

Phase 3: CBOR + zstd Compression (3.8 MB, 21.3x, Δ16.8x)

Add general-purpose compression

# zstd dictionary learning discovers patterns:

Pattern 1: "system.cpu.user" (appears 10k times)
  → Dictionary entry #1 (2 bytes to reference)

Pattern 2: "server-01" (appears 10k times)
  → Dictionary entry #2

Pattern 3: Sequential timestamps
  → Run-length encoding finds patterns

Pattern 4: Label structure repetition
  → Reference previous occurrences

# Result: 63.9 MB → 3.8 MB (16.8x additional compression!)
# Combined: 80.8 MB → 3.8 MB (21.3x total)

✅ Biggest single jump: 16.8x improvement from zstd!

Phase 4: Binary Table with String Deduplication (2.8 MB, 28.9x, Δ1.36x)

Key Insight: String Deduplication

# Before: Each row stores full strings
{name: "system.cpu.user", labels: {host: "server-01", dc: "us-west-1"}}
{name: "system.cpu.user", labels: {host: "server-01", dc: "us-west-1"}}
# "system.cpu.user" repeated 1000 times = 17,000 bytes

# After: Store strings once, use IDs
strings: ["system.cpu.user", "server-01", "us-west-1"]
rows: [{name_id: 0, label_ids: [1, 2]}, {name_id: 0, label_ids: [1, 2]}]
# 17 bytes + (1000 × 3 bytes) = 3,017 bytes

# Savings: 5.6x just from deduplication!

✅ 28.9x total - deduplication enables better zstd compression

Phase 5: Columnar Storage (2.0 MB, 40.4x, Δ1.40x)

Key Insight: Group Similar Data

Row-oriented (Before)

row1: {ts: 1000, val: 72.3}
row2: {ts: 1060, val: 72.4}
row3: {ts: 1120, val: 72.5}

Mixed types, poor compression

Column-oriented (After)

timestamps: [1000, 1060, 1120]
values:     [72.3, 72.4, 72.5]

Similar values together = better compression

Why it works:

All timestamps together → delta encoding applies better
All values together → XOR/floating-point compression more effective
All string IDs together → run-length encoding finds patterns

✅ 40.4x compression - columnar layout unlocks specialized algorithms

Phase 6: Pattern-Aware Algorithms (1.0 MB, 79.7x, Δ1.97x)

The Breakthrough: Understand the Data!

Detect patterns and apply specialized compression:

Constant values: Store once + count (∞ compression!)
Near-constant: Base value + tiny deltas
Power-of-2: Store exponents instead of values
Mostly integers: Split integer/fractional parts
Periodic patterns: Template + deviations
Sparse data: Store only non-zero indices + values

Result: 3.19 bytes per data point (vs 8 bytes original) - 79.7x compression!

Deep Dive: Near-Constant Encoding

Exploit Values That Rarely Change

Many metrics are nearly constant with tiny deviations

# CPU usage values (varies slightly around 72%):
values = [72.34567, 72.34589, 72.34612, 72.34599, 72.34601, ...]
# Raw: 1000 values × 8 bytes = 8,000 bytes

# Near-constant encoding:
base_value = 72.34567
deltas = [0.00000, 0.00022, 0.00045, 0.00032, 0.00034, ...]

# Deltas are tiny! Scale and store as small integers:
scaled_deltas = [0, 22, 45, 32, 34, ...]  # × 0.00001
# Most deltas fit in 1-2 bytes instead of 8!

# Result:
# - Base value: 8 bytes
# - 1000 deltas × ~1.5 bytes = 1,500 bytes
# - Total: 1,508 bytes vs 8,000 bytes original
# - Compression: 5.3x!

# Even better with zstd: deltas have patterns too!

Deep Dive: Delta Encoding

Store Differences, Not Absolute Values

# Regular timestamps:
timestamps = [1000, 1060, 1120, 1180, 1240]
# Each: 8 bytes = 40 bytes total

# Delta encoding:
first = 1000
deltas = [60, 60, 60, 60]
# first (8 bytes) + deltas (4×2 bytes) = 16 bytes

# Double-delta (deltas of deltas):
deltas = [60, 60, 60, 60]
delta_deltas = [0, 0, 0]  # All zeros!
# Run-length encode: "60 repeated 4 times"
# Result: 12 bytes (3x compression!)

Metrics: Key Takeaways

✅ What Worked

Binary encoding - Simple, 1.3x gain
Generic compression - zstd gives 21x without domain knowledge
Columnar storage - Group similar data types
Pattern detection - Specialized algorithms per pattern type

🎯 The Progression

1x (JSON) → 21x (+ zstd) → 40x (+ columnar) → 80x (+ patterns)

Each technique builds on previous work!

Part 3: Logs

📝 Structured Text Compression

What Are Logs?

Semi-structured text event records

[Thu Jun 09 06:07:04 2005] [notice] LDAP: Built with OpenLDAP LDAP SDK
[Thu Jun 09 06:07:05 2005] [error] env.createBean2(): Factory error creating vm
[Thu Jun 09 06:07:19 2005] [notice] jk2_init() Found child 2330 in scoreboard slot 0
[Thu Jun 09 06:07:20 2005] [error] [client 10.0.0.153] File does not exist: /var/www/html
[Thu Jun 09 06:07:21 2005] [error] [client 10.0.0.153] Directory index forbidden

Key characteristics:

Repeated templates with variable values
Variable types: timestamps, IPs, IDs, numbers, paths
Massive redundancy in structure

📈 Logs: Progressive Journey

We'll explore 6 compression phases

From plain text through advanced log-specific techniques

Dataset: OpenSSH server logs (655,147 lines)

Starting point: 72.7 MB plain text

Each phase will demonstrate new optimizations

Phase 1: Plain Text Baseline (72.7 MB, 1.0x)

Raw OpenSSH log files

Dec 10 06:55:46 LabSZ sshd[24200]: reverse mapping checking getaddrinfo for ns.marryaldkfaczcz.com [173.234.31.186] failed - POSSIBLE BREAK-IN ATTEMPT!
Dec 10 06:55:46 LabSZ sshd[24200]: Invalid user webmaster from 173.234.31.186
Dec 10 06:55:46 LabSZ sshd[24200]: input_userauth_request: invalid user webmaster [preauth]
Dec 10 06:55:46 LabSZ sshd[24200]: pam_unix(sshd:auth): check pass; user unknown
Dec 10 06:55:46 LabSZ sshd[24200]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=173.234.31.186
Dec 10 06:55:48 LabSZ sshd[24200]: Failed password for invalid user webmaster from 173.234.31.186 port 38926 ssh2
Dec 10 06:55:48 LabSZ sshd[24200]: Connection closed by 173.234.31.186 [preauth]
Dec 10 07:02:47 LabSZ sshd[24203]: Connection closed by 212.47.254.145 [preauth]

Observations:

655,147 lines, each ~110 bytes average
Repeated structure: "timestamp hostname daemon[pid]: message"
Same message patterns with different variables
"Failed password for root" appears 139,818 times!

Phase 2: Plain Text + zstd (3.0 MB, 24.5x, Δ24.5x)

General compression finds patterns automatically

# zstd dictionary learning discovers repeated patterns:

Pattern 1: "Dec 10 " (appears 655k times)
  → Store once in dictionary, reference with 2 bytes

Pattern 2: " LabSZ sshd[" (appears 655k times)
  → Dictionary entry

Pattern 3: "Failed password for root from " (appears 140k times)
  → Dictionary entry

Pattern 4: "authentication failure" (appears 153k times)
  → Dictionary entry

# Compression result:
Original: 72,746,715 bytes
Compressed: 2,968,978 bytes
Ratio: 24.5x

✅ 24.5x with zero domain knowledge - impressive!

Phase 3: Template Extraction (2.0 MB, 36.3x, Δ1.48x)

The CLP Breakthrough: Separate Structure from Data!

# Original log lines (3 examples):
Dec 10 06:55:48 LabSZ sshd[24200]: Failed password for root from 173.234.31.186 port 38926 ssh2
Dec 10 07:14:32 LabSZ sshd[24205]: Failed password for root from 218.65.30.43 port 54913 ssh2
Dec 10 08:03:17 LabSZ sshd[24220]: Failed password for root from 61.174.51.214 port 58389 ssh2

# CLP separates into template + variables:
Template #16: " LabSZ sshd[]: Failed password for root from  port  ssh2"

Variables (columnar storage):
  TIMESTAMP: ["Dec 10 06:55:48", "Dec 10 07:14:32", ...]
  NUM (pid): [24200, 24205, 24220, ...]
  IP: ["173.234.31.186", "218.65.30.43", "61.174.51.214", ...]
  NUM (port): [38926, 54913, 58389, ...]

Line-to-template mapping: [16, 16, 16, 16, 16, ...]  # Just template IDs!

# This template used 139,818 times!
Template storage: 81 bytes × 1 = 81 bytes
Variables: 139,818 × (time + pid + IP + port) ≈ much smaller with type-aware compression

Phase 3: Template Statistics

Only 5,669 unique templates for 655,147 lines!

Template	Count	%	Reuse
Failed password for root from <IP> port <NUM> ssh2	139,818	21.3%	139,818x
authentication failure; ... rhost=<IP> user=root	139,572	21.3%	139,572x
Connection closed by <IP> [preauth]	68,958	10.5%	68,958x
Received disconnect from <IP>: <NUM>: Bye Bye [preauth]	46,593	7.1%	46,593x
PAM service(sshd) ignoring max retries; <NUM> > <NUM>	37,963	5.8%	37,963x
...and 5,664 more templates

Average template reuse: 655,147 lines ÷ 5,669 templates = 116x per template!

Phase 4: Type-Aware Variable Encoding (1.9 MB, 38.7x, Δ1.07x)

Compress each variable type optimally

# TIMESTAMP variables (655,147 occurrences):
Raw: ["Dec 10 06:55:46", "Dec 10 06:55:46", "Dec 10 06:55:48", ...]
Delta encoded:
  Base: "Dec 10 06:55:46"
  Deltas: [0, 0, 2, 1, 58, 0, ...]  # Seconds difference
  Compression: 12,447,793 bytes → efficient storage (using numpy arrays)

# IP variables (549,454 occurrences):
Raw: ["173.234.31.186", "173.234.31.186", "212.47.254.145", ...]
Integer encoding + Dictionary:
  Convert to uint32: [2910973370, 2910973370, 3561263233, ...]
  Dictionary + indices for repeated IPs
  Compression: ~6.8 MB → 2.2 MB (3.1x)

# NUM (port/pid) variables (1,458,375 occurrences):
Raw: [24200, 38926, 24203, 54913, 24220, 58389, ...]
Numpy array (efficient integer storage):
  Compression: Uses compact numpy dtype
  Size: 11.7 MB (efficient integer packing)

# PATH variables (11 occurrences):
Raw: ["/var/log/secure", "/etc/ssh/sshd_config", ...]
Dictionary with prefix compression: 232 bytes

Phase 5: Smart Row Ordering (1.9 MB, 38.7x, Δ1.00x)

# Strategy: Group by template, sort within groups by time
# Before (chronological, interleaved):
Line 1000: Template 16 "Failed password for root"
Line 1001: Template 6  "Connection closed"
Line 1002: Template 16 "Failed password for root"
Line 1003: Template 21 "authentication failure"
Line 1004: Template 16 "Failed password for root"
# Poor compression - similar patterns scattered

# After (grouped by template):
Template 16: "Failed password for root"  × 139,818 lines
  Dec 10 06:55:48 LabSZ sshd[24200]: ... from 173.234.31.186 port 38926 ssh2
  Dec 10 07:14:32 LabSZ sshd[24205]: ... from 218.65.30.43 port 54913 ssh2
  Dec 10 08:03:17 LabSZ sshd[24220]: ... from 61.174.51.214 port 58389 ssh2
  ... (all similar logs grouped)

Template 21: "authentication failure" × 139,572 lines
Template 6: "Connection closed" × 68,958 lines

# Benefits:
# - Similar variables adjacent → better delta encoding
# - Template IDs grouped → better RLE compression
# - zstd finds longer matching patterns

Phase 5: Why No Improvement?

⚠️ Preserving original order is expensive!

How Order Mapping Works:

Raw data: 655,147 uint32 indices (2.6 MB)
Delta encoding: Small deltas between reordered positions
Varint encoding: Small numbers use fewer bytes
Zstd L22: Compresses patterns in deltas
Result: 2.6 MB → 130 KB (20x compression)

Final outcome: 1.895 MB → 1.897 MB (0.999x) - essentially no change because the ordering overhead (130 KB) is offset by better template grouping compression.

Solution: Phase 6 drops order preservation to get the full benefit!

Phase 6: Drop Order Preservation (1.77 MB, 41.2x, Δ1.07x)

Remove line ordering for maximum compression

# Before (Phase 5): Preserve line numbers
{
  "templates": [...],
  "line_to_template": [16, 16, 16, 6, 21, 16, ...],  # 655,147 entries
  "variables_per_line": {
    0: {"time": "...", "pid": ..., "ip": "...", "port": ...},
    1: {"time": "...", "pid": ..., "ip": "...", "port": ...},
    ...
  }
}

# After (Phase 6): No line numbers, just templates + variables
{
  "templates": [...],
  "template_data": {
    16: {  # Template 16: 139,818 occurrences
      "times": ["Dec 10 06:55:48", ...],         # 139,818 values
      "pids": [24200, 24205, ...],               # 139,818 values
      "ips": ["173.234.31.186", ...],            # 139,818 values
      "ports": [38926, 54913, ...]               # 139,818 values
    },
    21: {  # Template 21: 139,572 occurrences
      "times": [...],  # 139,572 values
      "pids": [...],
      "ips": [...],
    },
    ...
  }
}

# Benefits:
# - No need to store 655,147 line-to-template mappings
# - Variables perfectly grouped by template
# - Maximum compression locality
# - Additional 1.07x improvement

Logs: Compression Comparison

Phase	Format	Size	Ratio	Key Technique
1	Plain Text	72.7 MB	1.0x	None (baseline)
2	Plain + zstd	3.0 MB	24.5x	Pattern dictionary
3	Templates + zstd	2.0 MB	36.3x	CLP template extraction
4	Type-aware vars	1.9 MB	38.7x	Delta/dict per type
5	Smart ordering	1.9 MB	38.7x	Group by template
6	No ordering	1.77 MB	41.2x	Drop line numbers

Logs: Dataset Comparison

Dataset	Lines	Size	Templates	Reuse	Final Ratio
Apache (small)	56,482	5.1 MB	38	1,486x	50.7x
HDFS (big)	74,859	10.5 MB	18	4,159x	24.2x
OpenSSH (huge)	655,147	72.7 MB	5,669	116x	41.2x

Why is HDFS worse despite 4,159x template reuse?

High-cardinality variables: 74,859 unique block identifiers (296 KB encoded)
Lots of numbers: 141,631 numeric values (566 KB)
Many unique paths: 58,747 file paths (1.22 MB)
Even with timestamp delta encoding: 74,859 timestamps → 145 KB (6.7x, but still significant)
Total variable overhead: 3.9 MB uncompressed (despite only 18 templates!)

Key insight: Template reuse alone isn't enough - variable types, cardinality, and uniqueness dominate!

Logs: Key Takeaways

✅ CLP Algorithm is the Star

Template extraction: Separating structure from data enables massive compression
Template reuse: Key metric - OpenSSH achieves 116x average reuse (5,669 templates for 655k lines)
Type-aware encoding: Each variable type compressed optimally (3-6x per type)
Ordering optimizations: Grouping similar data improves compression locality

🎯 The Progression

1x (plain) → 24.5x (+ zstd) → 36.3x (+ templates) → 38.7x (+ type-aware) → 41.2x (+ optimizations)

Part 4: Traces

🔍 Distributed Execution Compression

What Are Traces?

Distributed request execution flows

Track a request across multiple microservices

User Request → api-gateway (50ms)
                  ↓
               user-service (30ms)
                  ↓
               auth-service (20ms)
                  ↓
               auth-db (10ms)

Each "span" contains:

trace_id (links all spans in one request)
span_id + parent_span_id (tree structure)
service_name, operation_name
start_time, end_time (nanoseconds)
tags, logs, status

📈 Traces: Progressive Journey

We'll explore 5 compression phases

From JSON through trace structure exploitation

Dataset: 50,000 traces, 141,531 spans, 12 services

Starting point: 71.1 MB NDJSON

Each phase will reveal new structural optimizations

Phase 0: Original JSON (71.1 MB)

OpenTelemetry-style distributed trace data

{
  "trace_id": "trace-00000001-df734723",
  "spans": [{
    "trace_id": "trace-00000001-df734723",
    "span_id": "666ffb28a8ca45d7",
    "parent_span_id": null,
    "operation_name": "authenticate",
    "service_name": "api-gateway",
    "start_time": 1761416099523944960,
    "end_time": 1761416099585944960,
    "duration": 62000000,
    "tags": {
      "service.name": "api-gateway",
      "http.method": "DELETE",
      "http.status_code": 201,
      "user.id": "user-2256"
    },
    "logs": [],
    "status_code": 0
  }]
}

❌ Verbose: Every field spelled out, repeated service/operation names

Phase 1: NDJSON Baseline (71.1 MB, 1.0x)

One span per line (newline-delimited JSON)

{"trace_id":"trace-00000001-df734723","span_id":"666ffb28a8ca45d7",...}
{"trace_id":"trace-00000001-df734723","span_id":"8a89dc3a1a204434",...}
{"trace_id":"trace-00000001-df734723","span_id":"b2deb5edd9a74b4f",...}
{"trace_id":"trace-00000002-a9549627","span_id":"868b3925a9374271",...}

Characteristics:

141,531 spans = 141,531 lines
Text-based, human-readable
Easy to process line-by-line
Repeated keys: "trace_id", "span_id", "service_name", etc.

✅ Baseline established: 71.1 MB

Phase 2: CBOR Binary (39.6 MB, 1.79x, Δ1.79x)

Binary encoding with type tags

# JSON (text):
{"trace_id": "trace-00000001-df734723", "span_id": "666ffb28a8ca45d7"}
# 71 bytes

# CBOR (binary):
A2                        # map(2 items)
  68 74726163655F6964     # "trace_id" (8 bytes)
  78 1C 74726163652D...   # "trace-00000001-df734723" (28 bytes)
  67 7370616E5F6964       # "span_id" (7 bytes)
  70 363636666662323...   # "666ffb28a8ca45d7" (16 bytes)
# 39 bytes (44% smaller!)

# Still verbose: All keys/values in every span
# But binary integers, efficient string encoding

✅ 1.79x compression through binary encoding

Phase 3: CBOR + zstd (5.6 MB, 12.6x, Δ7.04x)

General-purpose compression on binary data

# zstd identifies patterns:
Pattern 1: "trace_id" appears 141,531 times
  → Dictionary entry #1 (4 bytes to reference)

Pattern 2: "api-gateway" appears 50,000 times
  → Dictionary entry #2

Pattern 3: Sequential timestamps
  → Run-length encoding

Pattern 4: Repeated tag structures
  → Reference previous occurrence

# Result: 39.6 MB → 5.6 MB (7x additional compression)

✅ 12.6x total compression - biggest single jump!

Phase 4: Span Relationships (2.3 MB, 30.5x, Δ2.43x)

The Breakthrough: Exploit Trace Structure!

# Before: Every span stores full identifiers
{
  "trace_id": "trace-00000001-df734723",  # 28 bytes
  "span_id": "666ffb28a8ca45d7",          # 16 bytes
  "parent_span_id": "8a89dc3a1a204434",   # 16 bytes
  "service_name": "api-gateway",          # 11 bytes
  "operation_name": "authenticate"        # 12 bytes
}

# After: Dictionary + sequential IDs
{
  "services": ["api-gateway", "user-service", ...],  # 12 strings total
  "operations": ["authenticate", "route_request", ...],  # 12 strings
  "traces": [
    {
      "trace_id": "trace-00000001-df734723",
      "spans": [
        {"svc": 0, "op": 0, "parent": -1, "duration": 62000000},  # Root
        {"svc": 1, "op": 2, "parent": 0, "duration": 97000000},   # Child of 0
        {"svc": 2, "op": 5, "parent": 1, "duration": 31000000}    # Child of 1
      ]
    }
  ]
}

Phase 4: Relationship Compression Details

Service Deduplication

# Before:
"api-gateway" × 50,000 = 550 KB
"user-service" × 8,457 = 93 KB
Total: ~1.5 MB

# After:
12 service names = 150 bytes
50K IDs × 1 byte = 50 KB
Total: 50 KB (30x!)

Parent Relationships

# Before:
UUID parent_span_id × 141K
= 2.3 MB

# After:
Sequential index × 141K
= 141 KB (16x!)

Root: parent = -1
Others: parent = 0-9

Timestamp Delta Encoding

# Root span: 1761416099523944960 (store full timestamp)
# Child span 1: +10,000,000 (10ms later, store delta)
# Child span 2: +20,000,000 (20ms after child 1)
# Result: 2-3 bytes per delta vs 8 bytes absolute

✅ 30.5x compression through structure exploitation!

Phase 5: Columnar Storage (1.9 MB, 37.3x, Δ1.22x)

Separate data by type for better compression

# Before (row-oriented):
span1: {duration: 62000000, status: 0, parent: -1}
span2: {duration: 97000000, status: 0, parent: 0}
span3: {duration: 31000000, status: 0, parent: 1}
# Mixed types, poor compression locality

# After (columnar):
{
  "durations": [62000000, 97000000, 31000000, ...],  # All similar values
  "status_codes": [0, 0, 0, 0, 1, 0, ...],           # Mostly zeros
  "parent_indices": [-1, 0, 1, 2, 1, 0, ...],        # Small integers
  "span_positions": [4, 9, 1, 1, 1, 9, 9, ...]       # Spans per trace
}

# Compressed columnar arrays:
# - durations: 708KB → 308KB (delta encoding + zstd)
# - status_codes: 141KB → 4KB (run-length: mostly zeros!)
# - parent_indices: 141KB → 11KB (small integers compress well)

✅ 37.3x total compression - columnar wins!

Traces: Compression Comparison

Phase	Format	Size	Ratio	Key Technique
0	Original JSON	71.1 MB	1.0x	None (baseline)
1	NDJSON	71.1 MB	1.0x	Line-delimited format
2	CBOR	39.6 MB	1.79x	Binary encoding
3	CBOR + zstd	5.6 MB	12.6x	Dictionary compression
4	Relationships	2.3 MB	30.5x	Service dict + sequential IDs
5	Columnar	1.9 MB	37.3x	Type-grouped arrays

From 71.1 MB → 1.9 MB with structured exploitation!

Dataset: 50,000 traces × 2.8 avg spans = 141,531 total spans

Traces: Key Takeaways

✅ Structure Exploitation Wins

Service topology: 12 services across 141K spans → 22x deduplication
Parent-child relationships: Sequential IDs vs UUIDs → 16x compression
Timestamp locality: Delta from root → 2-4x compression
Columnar arrays: Group similar types → 1.2x additional improvement

🎯 The Progression

1x (NDJSON) → 1.8x (+ binary) → 12.6x (+ zstd) → 30.5x (+ relationships) → 37.3x (+ columnar)

📊 Format Choice Matters

Relationship format for storage, Columnar format for analytics

Part 5: Reality Check

⚖️ Trade-offs & Practical Considerations

⚠️ The Compression Trilemma

You Can't Optimize Everything!

💾 Storage Size

Minimize disk usage

✅ Focus of this project

⚡ Ingest Speed

Fast data writing

⚠️ CPU cost per write

🔍 Query Speed

Fast data reading

⚠️ Decompression overhead

Not always, but very often tradeoff desicions have to be made - optimizing one hurts the others

💰 Real-World Trade-offs

💾 Optimize Storage

Analyze statistics, pick best algorithm

Example: Parquet - auto-selects dictionary/RLE/bit-packing per column based on cardinality and patterns

Cost: Slow ingest & queries

⚡ Optimize Ingest Speed

Write append-only, compress in background

Example: Grafana Loki - only lightweight general compression, no columnarization, no merging of chunks

Cost: Higher storage, write amplification

🔍 Optimize Query Speed

Build indexes to skip decompression

Example: Elasticsearch inverted indexes - skip 99% of documents without scanning

Cost: 5-15% storage overhead

Key insight: Production systems use tiered storage - different techniques for hot (recent), warm (weeks), and cold (archive) data

🏭 Production System Examples

Prometheus (Metrics)

Uses XOR compression (Gorilla algorithm) + delta encoding
Compression: ~10-20x (balanced approach)
Query speed: Milliseconds for recent data
Trade-off: Moderate compression for fast queries

Elasticsearch (Logs)

Uses Lucene inverted indexes + LZ4/DEFLATE compression
Compression: ~5-15x (optimized for search speed)
Query speed: Milliseconds with skip lists and indexes
Trade-off: Lower compression for fast full-text search

Jaeger/Tempo (Traces)

Use Parquet columnar format for storage
Compression: 10-30x depending on data
Trade-off: Optimized for querying by service/operation

🎓 Key Lessons Learned

✅ Domain Knowledge is Powerful

Understanding data characteristics enables order of magnitude better compression than generic algorithms alone

📐 The Compression Hierarchy

Binary encoding - Easy win, 1.3x
Generic compression - zstd gives 10-30x baseline
Structure optimization - Deduplication, columnar: +2-3x
Domain-specific algorithms - Pattern-aware: +2-4x

⚠️ No Free Lunch

Higher compression = Higher CPU cost. Choose based on your access patterns and scale.

🔬 Try It Yourself!

The squeezed-signals Repository

github.com/flash1293/squeezed-signals

# Clone and set up
git clone https://github.com/flash1293/squeezed-signals.git
cd squeezed-signals
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Try metrics compression
cd metrics
python main.py --size small

# Try logs compression  
cd ../logs
python main.py --size small

# Try traces compression
cd ../traces
python main.py --size small

Summary

Compression Achievements

📊 Metrics: 84 MB → 1 MB (79.7x) via pattern-aware algorithms
📝 Logs: 72.7 MB → 1.77 MB (41.2x) via template extraction (CLP)
🔍 Traces: 71.1 MB → 1.9 MB (37.3x) via relationship encoding

The Universal Principle

Understand your data → Apply specialized techniques → Compound with general compression

Questions?

🗜️ github.com/flash1293/squeezed-signals

Press ? for keyboard shortcuts