A deep dive into compressing metrics, logs, and traces
Press Space to navigate โ
The Observability Data Problem
Observability = Understanding what's happening inside your systems
Numerical measurements
Text event records
Distributed request flows across microservices
api-gateway โ user-service โ auth-service โ database
Demonstrate progressive compression techniques from naive approaches to production-grade algorithms
| Signal Type | Starting Size | Key Challenge |
|---|---|---|
| ๐ Metrics | 80.8 MB NDJSON | Temporal patterns in numbers |
| ๐ Logs | 72.7 MB Plain Text | Repeated structure with variables |
| ๐ Traces | 71.1 MB NDJSON | Parent-child relationships |
Each uses different strategies tailored to data characteristics
๐ Time-Series Compression Journey
Regular intervals, labeled dimensions
{
"metric_name": "system.cpu.user",
"labels": {
"host": "server-01",
"datacenter": "us-west-1"
},
"timestamp": 1698000000,
"value": 72.34567
}
Key characteristics:
Each phase builds on the previous, demonstrating how domain knowledge compounds with general techniques
Starting point: 80.8 MB NDJSON
Each phase will reveal new techniques and improvements
{"metric_name":"system.cpu.user","labels":{"host":"server-01","datacenter":"us-west-1"},"timestamp":1698000000,"value":72.34567}
{"metric_name":"system.cpu.user","labels":{"host":"server-01","datacenter":"us-west-1"},"timestamp":1698000060,"value":72.34589}
{"metric_name":"system.cpu.user","labels":{"host":"server-01","datacenter":"us-west-1"},"timestamp":1698000120,"value":72.34612}
Observations:
# JSON (text):
{"timestamp": 1698000000}
# 25 bytes
# CBOR (binary):
A1 # map(1 item)
69 74696D657374616D70 # "timestamp" (9 bytes)
1A 6536CD80 # unsigned(1698000000) (5 bytes)
# 15 bytes (40% smaller!)
# Benefits:
# - Integers stored as binary, not strings
# - Type information embedded efficiently
# - No whitespace or quotes needed
# - Still preserves all structure
โ 1.26x compression through binary encoding
# zstd dictionary learning discovers patterns:
Pattern 1: "system.cpu.user" (appears 10k times)
โ Dictionary entry #1 (2 bytes to reference)
Pattern 2: "server-01" (appears 10k times)
โ Dictionary entry #2
Pattern 3: Sequential timestamps
โ Run-length encoding finds patterns
Pattern 4: Label structure repetition
โ Reference previous occurrences
# Result: 63.9 MB โ 3.8 MB (16.8x additional compression!)
# Combined: 80.8 MB โ 3.8 MB (21.3x total)
โ Biggest single jump: 16.8x improvement from zstd!
# Before: Each row stores full strings
{name: "system.cpu.user", labels: {host: "server-01", dc: "us-west-1"}}
{name: "system.cpu.user", labels: {host: "server-01", dc: "us-west-1"}}
# "system.cpu.user" repeated 1000 times = 17,000 bytes
# After: Store strings once, use IDs
strings: ["system.cpu.user", "server-01", "us-west-1"]
rows: [{name_id: 0, label_ids: [1, 2]}, {name_id: 0, label_ids: [1, 2]}]
# 17 bytes + (1000 ร 3 bytes) = 3,017 bytes
# Savings: 5.6x just from deduplication!
โ 28.9x total - deduplication enables better zstd compression
row1: {ts: 1000, val: 72.3}
row2: {ts: 1060, val: 72.4}
row3: {ts: 1120, val: 72.5}
Mixed types, poor compression
timestamps: [1000, 1060, 1120]
values: [72.3, 72.4, 72.5]
Similar values together = better compression
Why it works:
โ 40.4x compression - columnar layout unlocks specialized algorithms
Detect patterns and apply specialized compression:
Result: 3.19 bytes per data point (vs 8 bytes original) - 79.7x compression!
Many metrics are nearly constant with tiny deviations
# CPU usage values (varies slightly around 72%):
values = [72.34567, 72.34589, 72.34612, 72.34599, 72.34601, ...]
# Raw: 1000 values ร 8 bytes = 8,000 bytes
# Near-constant encoding:
base_value = 72.34567
deltas = [0.00000, 0.00022, 0.00045, 0.00032, 0.00034, ...]
# Deltas are tiny! Scale and store as small integers:
scaled_deltas = [0, 22, 45, 32, 34, ...] # ร 0.00001
# Most deltas fit in 1-2 bytes instead of 8!
# Result:
# - Base value: 8 bytes
# - 1000 deltas ร ~1.5 bytes = 1,500 bytes
# - Total: 1,508 bytes vs 8,000 bytes original
# - Compression: 5.3x!
# Even better with zstd: deltas have patterns too!
# Regular timestamps:
timestamps = [1000, 1060, 1120, 1180, 1240]
# Each: 8 bytes = 40 bytes total
# Delta encoding:
first = 1000
deltas = [60, 60, 60, 60]
# first (8 bytes) + deltas (4ร2 bytes) = 16 bytes
# Double-delta (deltas of deltas):
deltas = [60, 60, 60, 60]
delta_deltas = [0, 0, 0] # All zeros!
# Run-length encode: "60 repeated 4 times"
# Result: 12 bytes (3x compression!)
1x (JSON) โ 21x (+ zstd) โ 40x (+ columnar) โ 80x (+ patterns)
Each technique builds on previous work!
๐ Structured Text Compression
[Thu Jun 09 06:07:04 2005] [notice] LDAP: Built with OpenLDAP LDAP SDK
[Thu Jun 09 06:07:05 2005] [error] env.createBean2(): Factory error creating vm
[Thu Jun 09 06:07:19 2005] [notice] jk2_init() Found child 2330 in scoreboard slot 0
[Thu Jun 09 06:07:20 2005] [error] [client 10.0.0.153] File does not exist: /var/www/html
[Thu Jun 09 06:07:21 2005] [error] [client 10.0.0.153] Directory index forbidden
Key characteristics:
From plain text through advanced log-specific techniques
Dataset: OpenSSH server logs (655,147 lines)
Starting point: 72.7 MB plain text
Each phase will demonstrate new optimizations
Dec 10 06:55:46 LabSZ sshd[24200]: reverse mapping checking getaddrinfo for ns.marryaldkfaczcz.com [173.234.31.186] failed - POSSIBLE BREAK-IN ATTEMPT!
Dec 10 06:55:46 LabSZ sshd[24200]: Invalid user webmaster from 173.234.31.186
Dec 10 06:55:46 LabSZ sshd[24200]: input_userauth_request: invalid user webmaster [preauth]
Dec 10 06:55:46 LabSZ sshd[24200]: pam_unix(sshd:auth): check pass; user unknown
Dec 10 06:55:46 LabSZ sshd[24200]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=173.234.31.186
Dec 10 06:55:48 LabSZ sshd[24200]: Failed password for invalid user webmaster from 173.234.31.186 port 38926 ssh2
Dec 10 06:55:48 LabSZ sshd[24200]: Connection closed by 173.234.31.186 [preauth]
Dec 10 07:02:47 LabSZ sshd[24203]: Connection closed by 212.47.254.145 [preauth]
Observations:
# zstd dictionary learning discovers repeated patterns:
Pattern 1: "Dec 10 " (appears 655k times)
โ Store once in dictionary, reference with 2 bytes
Pattern 2: " LabSZ sshd[" (appears 655k times)
โ Dictionary entry
Pattern 3: "Failed password for root from " (appears 140k times)
โ Dictionary entry
Pattern 4: "authentication failure" (appears 153k times)
โ Dictionary entry
# Compression result:
Original: 72,746,715 bytes
Compressed: 2,968,978 bytes
Ratio: 24.5x
โ 24.5x with zero domain knowledge - impressive!
# Original log lines (3 examples):
Dec 10 06:55:48 LabSZ sshd[24200]: Failed password for root from 173.234.31.186 port 38926 ssh2
Dec 10 07:14:32 LabSZ sshd[24205]: Failed password for root from 218.65.30.43 port 54913 ssh2
Dec 10 08:03:17 LabSZ sshd[24220]: Failed password for root from 61.174.51.214 port 58389 ssh2
# CLP separates into template + variables:
Template #16: " LabSZ sshd[]: Failed password for root from port ssh2"
Variables (columnar storage):
TIMESTAMP: ["Dec 10 06:55:48", "Dec 10 07:14:32", ...]
NUM (pid): [24200, 24205, 24220, ...]
IP: ["173.234.31.186", "218.65.30.43", "61.174.51.214", ...]
NUM (port): [38926, 54913, 58389, ...]
Line-to-template mapping: [16, 16, 16, 16, 16, ...] # Just template IDs!
# This template used 139,818 times!
Template storage: 81 bytes ร 1 = 81 bytes
Variables: 139,818 ร (time + pid + IP + port) โ much smaller with type-aware compression
| Template | Count | % | Reuse |
|---|---|---|---|
| Failed password for root from <IP> port <NUM> ssh2 | 139,818 | 21.3% | 139,818x |
| authentication failure; ... rhost=<IP> user=root | 139,572 | 21.3% | 139,572x |
| Connection closed by <IP> [preauth] | 68,958 | 10.5% | 68,958x |
| Received disconnect from <IP>: <NUM>: Bye Bye [preauth] | 46,593 | 7.1% | 46,593x |
| PAM service(sshd) ignoring max retries; <NUM> > <NUM> | 37,963 | 5.8% | 37,963x |
| ...and 5,664 more templates | |||
# TIMESTAMP variables (655,147 occurrences):
Raw: ["Dec 10 06:55:46", "Dec 10 06:55:46", "Dec 10 06:55:48", ...]
Delta encoded:
Base: "Dec 10 06:55:46"
Deltas: [0, 0, 2, 1, 58, 0, ...] # Seconds difference
Compression: 12,447,793 bytes โ efficient storage (using numpy arrays)
# IP variables (549,454 occurrences):
Raw: ["173.234.31.186", "173.234.31.186", "212.47.254.145", ...]
Integer encoding + Dictionary:
Convert to uint32: [2910973370, 2910973370, 3561263233, ...]
Dictionary + indices for repeated IPs
Compression: ~6.8 MB โ 2.2 MB (3.1x)
# NUM (port/pid) variables (1,458,375 occurrences):
Raw: [24200, 38926, 24203, 54913, 24220, 58389, ...]
Numpy array (efficient integer storage):
Compression: Uses compact numpy dtype
Size: 11.7 MB (efficient integer packing)
# PATH variables (11 occurrences):
Raw: ["/var/log/secure", "/etc/ssh/sshd_config", ...]
Dictionary with prefix compression: 232 bytes
# Strategy: Group by template, sort within groups by time
# Before (chronological, interleaved):
Line 1000: Template 16 "Failed password for root"
Line 1001: Template 6 "Connection closed"
Line 1002: Template 16 "Failed password for root"
Line 1003: Template 21 "authentication failure"
Line 1004: Template 16 "Failed password for root"
# Poor compression - similar patterns scattered
# After (grouped by template):
Template 16: "Failed password for root" ร 139,818 lines
Dec 10 06:55:48 LabSZ sshd[24200]: ... from 173.234.31.186 port 38926 ssh2
Dec 10 07:14:32 LabSZ sshd[24205]: ... from 218.65.30.43 port 54913 ssh2
Dec 10 08:03:17 LabSZ sshd[24220]: ... from 61.174.51.214 port 58389 ssh2
... (all similar logs grouped)
Template 21: "authentication failure" ร 139,572 lines
Template 6: "Connection closed" ร 68,958 lines
# Benefits:
# - Similar variables adjacent โ better delta encoding
# - Template IDs grouped โ better RLE compression
# - zstd finds longer matching patterns
Final outcome: 1.895 MB โ 1.897 MB (0.999x) - essentially no change because the ordering overhead (130 KB) is offset by better template grouping compression.
Solution: Phase 6 drops order preservation to get the full benefit!
# Before (Phase 5): Preserve line numbers
{
"templates": [...],
"line_to_template": [16, 16, 16, 6, 21, 16, ...], # 655,147 entries
"variables_per_line": {
0: {"time": "...", "pid": ..., "ip": "...", "port": ...},
1: {"time": "...", "pid": ..., "ip": "...", "port": ...},
...
}
}
# After (Phase 6): No line numbers, just templates + variables
{
"templates": [...],
"template_data": {
16: { # Template 16: 139,818 occurrences
"times": ["Dec 10 06:55:48", ...], # 139,818 values
"pids": [24200, 24205, ...], # 139,818 values
"ips": ["173.234.31.186", ...], # 139,818 values
"ports": [38926, 54913, ...] # 139,818 values
},
21: { # Template 21: 139,572 occurrences
"times": [...], # 139,572 values
"pids": [...],
"ips": [...],
},
...
}
}
# Benefits:
# - No need to store 655,147 line-to-template mappings
# - Variables perfectly grouped by template
# - Maximum compression locality
# - Additional 1.07x improvement
| Phase | Format | Size | Ratio | Key Technique |
|---|---|---|---|---|
| 1 | Plain Text | 72.7 MB | 1.0x | None (baseline) |
| 2 | Plain + zstd | 3.0 MB | 24.5x | Pattern dictionary |
| 3 | Templates + zstd | 2.0 MB | 36.3x | CLP template extraction |
| 4 | Type-aware vars | 1.9 MB | 38.7x | Delta/dict per type |
| 5 | Smart ordering | 1.9 MB | 38.7x | Group by template |
| 6 | No ordering | 1.77 MB | 41.2x | Drop line numbers |
| Dataset | Lines | Size | Templates | Reuse | Final Ratio |
|---|---|---|---|---|---|
| Apache (small) | 56,482 | 5.1 MB | 38 | 1,486x | 50.7x |
| HDFS (big) | 74,859 | 10.5 MB | 18 | 4,159x | 24.2x |
| OpenSSH (huge) | 655,147 | 72.7 MB | 5,669 | 116x | 41.2x |
Why is HDFS worse despite 4,159x template reuse?
Key insight: Template reuse alone isn't enough - variable types, cardinality, and uniqueness dominate!
1x (plain) โ 24.5x (+ zstd) โ 36.3x (+ templates) โ 38.7x (+ type-aware) โ 41.2x (+ optimizations)
๐ Distributed Execution Compression
Track a request across multiple microservices
User Request โ api-gateway (50ms)
โ
user-service (30ms)
โ
auth-service (20ms)
โ
auth-db (10ms)
Each "span" contains:
From JSON through trace structure exploitation
Dataset: 50,000 traces, 141,531 spans, 12 services
Starting point: 71.1 MB NDJSON
Each phase will reveal new structural optimizations
{
"trace_id": "trace-00000001-df734723",
"spans": [{
"trace_id": "trace-00000001-df734723",
"span_id": "666ffb28a8ca45d7",
"parent_span_id": null,
"operation_name": "authenticate",
"service_name": "api-gateway",
"start_time": 1761416099523944960,
"end_time": 1761416099585944960,
"duration": 62000000,
"tags": {
"service.name": "api-gateway",
"http.method": "DELETE",
"http.status_code": 201,
"user.id": "user-2256"
},
"logs": [],
"status_code": 0
}]
}
โ Verbose: Every field spelled out, repeated service/operation names
{"trace_id":"trace-00000001-df734723","span_id":"666ffb28a8ca45d7",...}
{"trace_id":"trace-00000001-df734723","span_id":"8a89dc3a1a204434",...}
{"trace_id":"trace-00000001-df734723","span_id":"b2deb5edd9a74b4f",...}
{"trace_id":"trace-00000002-a9549627","span_id":"868b3925a9374271",...}
Characteristics:
โ Baseline established: 71.1 MB
# JSON (text):
{"trace_id": "trace-00000001-df734723", "span_id": "666ffb28a8ca45d7"}
# 71 bytes
# CBOR (binary):
A2 # map(2 items)
68 74726163655F6964 # "trace_id" (8 bytes)
78 1C 74726163652D... # "trace-00000001-df734723" (28 bytes)
67 7370616E5F6964 # "span_id" (7 bytes)
70 363636666662323... # "666ffb28a8ca45d7" (16 bytes)
# 39 bytes (44% smaller!)
# Still verbose: All keys/values in every span
# But binary integers, efficient string encoding
โ 1.79x compression through binary encoding
# zstd identifies patterns:
Pattern 1: "trace_id" appears 141,531 times
โ Dictionary entry #1 (4 bytes to reference)
Pattern 2: "api-gateway" appears 50,000 times
โ Dictionary entry #2
Pattern 3: Sequential timestamps
โ Run-length encoding
Pattern 4: Repeated tag structures
โ Reference previous occurrence
# Result: 39.6 MB โ 5.6 MB (7x additional compression)
โ 12.6x total compression - biggest single jump!
# Before: Every span stores full identifiers
{
"trace_id": "trace-00000001-df734723", # 28 bytes
"span_id": "666ffb28a8ca45d7", # 16 bytes
"parent_span_id": "8a89dc3a1a204434", # 16 bytes
"service_name": "api-gateway", # 11 bytes
"operation_name": "authenticate" # 12 bytes
}
# After: Dictionary + sequential IDs
{
"services": ["api-gateway", "user-service", ...], # 12 strings total
"operations": ["authenticate", "route_request", ...], # 12 strings
"traces": [
{
"trace_id": "trace-00000001-df734723",
"spans": [
{"svc": 0, "op": 0, "parent": -1, "duration": 62000000}, # Root
{"svc": 1, "op": 2, "parent": 0, "duration": 97000000}, # Child of 0
{"svc": 2, "op": 5, "parent": 1, "duration": 31000000} # Child of 1
]
}
]
}
# Before:
"api-gateway" ร 50,000 = 550 KB
"user-service" ร 8,457 = 93 KB
Total: ~1.5 MB
# After:
12 service names = 150 bytes
50K IDs ร 1 byte = 50 KB
Total: 50 KB (30x!)
# Before:
UUID parent_span_id ร 141K
= 2.3 MB
# After:
Sequential index ร 141K
= 141 KB (16x!)
Root: parent = -1
Others: parent = 0-9
# Root span: 1761416099523944960 (store full timestamp)
# Child span 1: +10,000,000 (10ms later, store delta)
# Child span 2: +20,000,000 (20ms after child 1)
# Result: 2-3 bytes per delta vs 8 bytes absolute
โ 30.5x compression through structure exploitation!
# Before (row-oriented):
span1: {duration: 62000000, status: 0, parent: -1}
span2: {duration: 97000000, status: 0, parent: 0}
span3: {duration: 31000000, status: 0, parent: 1}
# Mixed types, poor compression locality
# After (columnar):
{
"durations": [62000000, 97000000, 31000000, ...], # All similar values
"status_codes": [0, 0, 0, 0, 1, 0, ...], # Mostly zeros
"parent_indices": [-1, 0, 1, 2, 1, 0, ...], # Small integers
"span_positions": [4, 9, 1, 1, 1, 9, 9, ...] # Spans per trace
}
# Compressed columnar arrays:
# - durations: 708KB โ 308KB (delta encoding + zstd)
# - status_codes: 141KB โ 4KB (run-length: mostly zeros!)
# - parent_indices: 141KB โ 11KB (small integers compress well)
โ 37.3x total compression - columnar wins!
| Phase | Format | Size | Ratio | Key Technique |
|---|---|---|---|---|
| 0 | Original JSON | 71.1 MB | 1.0x | None (baseline) |
| 1 | NDJSON | 71.1 MB | 1.0x | Line-delimited format |
| 2 | CBOR | 39.6 MB | 1.79x | Binary encoding |
| 3 | CBOR + zstd | 5.6 MB | 12.6x | Dictionary compression |
| 4 | Relationships | 2.3 MB | 30.5x | Service dict + sequential IDs |
| 5 | Columnar | 1.9 MB | 37.3x | Type-grouped arrays |
From 71.1 MB โ 1.9 MB with structured exploitation!
Dataset: 50,000 traces ร 2.8 avg spans = 141,531 total spans
1x (NDJSON) โ 1.8x (+ binary) โ 12.6x (+ zstd) โ 30.5x (+ relationships) โ 37.3x (+ columnar)
Relationship format for storage, Columnar format for analytics
โ๏ธ Trade-offs & Practical Considerations
Minimize disk usage
โ Focus of this project
Fast data writing
โ ๏ธ CPU cost per write
Fast data reading
โ ๏ธ Decompression overhead
Not always, but very often tradeoff desicions have to be made - optimizing one hurts the others
Analyze statistics, pick best algorithm
Example: Parquet - auto-selects dictionary/RLE/bit-packing per column based on cardinality and patterns
Cost: Slow ingest & queries
Write append-only, compress in background
Example: Grafana Loki - only lightweight general compression, no columnarization, no merging of chunks
Cost: Higher storage, write amplification
Build indexes to skip decompression
Example: Elasticsearch inverted indexes - skip 99% of documents without scanning
Cost: 5-15% storage overhead
Key insight: Production systems use tiered storage - different techniques for hot (recent), warm (weeks), and cold (archive) data
Understanding data characteristics enables order of magnitude better compression than generic algorithms alone
Higher compression = Higher CPU cost. Choose based on your access patterns and scale.
github.com/flash1293/squeezed-signals
# Clone and set up
git clone https://github.com/flash1293/squeezed-signals.git
cd squeezed-signals
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Try metrics compression
cd metrics
python main.py --size small
# Try logs compression
cd ../logs
python main.py --size small
# Try traces compression
cd ../traces
python main.py --size small
Understand your data โ Apply specialized techniques โ Compound with general compression
๐๏ธ github.com/flash1293/squeezed-signals
Press ? for keyboard shortcuts