An SSTable is an immutable, sorted key-value file. Every detail of the layout exists to make one of three operations cheap: building the file once, looking up one key, or scanning a range.
MiniKV’s SSTable lives in kv/sstable.go and is
on its fourth on-disk format (“v4”). This post is a tour from
outermost to innermost.
Top-level layout
flowchart LR
H[Header
magic + version
compression byte] --> D[Data blocks
~4 KiB each]
D --> B[Bloom filter]
B --> I[Block index
sparse: first key per block]
I --> F[Footer
offsets + magic]
The reader opens the file, reads the footer (fixed size, at a known
offset from EOF), and from the footer learns where the index and the
Bloom filter live. That is the only “scan back from EOF” step; from
then on every read is a single positioned ReadAt.
A data block
Inside a block, keys are stored with prefix compression and restart points:
1 | +------------------+ \ |
- Prefix compression: each record stores how many leading bytes it shares with the previous key, then the rest. For a workload with long common prefixes (paths, UUIDs, tenant-scoped keys) the win is large.
- Restart points: every K records (default 16) we “reset” the prefix compression and record the absolute offset. To seek inside a block we binary-search the restart array, then scan forward at most K-1 records.
A point lookup is therefore:
1 | binary search index → O(log B) where B = blocks in file |
For 4 KiB blocks with 16-record restarts and 1 M-key files, that’s
roughly log2(1M/64) + log2(64) + 16 ≈ 36 string comparisons.
The Bloom filter
A point lookup that returns “not found” still pays the index search + block read above. For workloads that miss often — caches, dedup, “have I seen this key?” — that wastes most of the work.
Each SSTable carries a file-level Bloom filter
(kv/bloom.go) sized for ~1% false-positive rate.
The read path checks Bloom first; on “definitely not present”, the
read returns immediately without touching the index or any block.
You can see the effect in the README benchmarks: Get miss is
sub-microsecond per call, because nothing past the filter ran.
Compression
Config.Compression selects per-block snappy (snappy) or no
compression (none). The choice is per-block, not per-file, because:
- The block is the unit you decompress in the read path. Whole-file compression would force a full-file decompress to read one key.
- A per-block byte in the data section lets you mix uncompressed (already-tiny) blocks with compressed ones — though MiniKV currently doesn’t bother and uses the file-level setting uniformly.
CRC32C per block
Every block ends with a CRC32C over its bytes. A failing CRC turns
into a kv.ErrCorrupt from the read path. Per-block CRCs (as opposed
to per-file or per-record) hit a nice middle:
- Per-record: too many CRC computations on the write path.
- Per-file: detection only on full reads; can’t isolate damage.
- Per-block: one CRC per ~4 KiB, which is also the unit of disk read.
The footer
The footer is the file’s table-of-contents:
1 | +----------------+ |
Fixed size, at EOF. The reader fseek-s EOF - footer_size, validates
the magic, and now knows where to find every other section. This
makes opening a 1 GiB SSTable an O(1) operation (modulo loading the
index, which is small).
Why “v4”?
The format has bumped versions for: introducing restart points
(v2-ish), adding the Bloom filter (v3), and adding per-block
compression metadata (v4). The header carries the version byte and
LoadSSTableHandle (kv/sstable.go)
rejects unknown versions with ErrUnsupportedFormat. There is no
in-place migration; you compact old files into new format files.
What this layout buys you
| Operation | Cost |
|---|---|
| Open file | 1 ReadAt for footer + 1 for index |
| Point lookup (miss) | 1 Bloom check |
| Point lookup (hit) | 1 block ReadAt + decompress + scan |
| Range scan | sequential block reads, no random I/O |
| Build | single pass over sorted input, one fsync at end |
Those four lines are why every LSM engine in production looks roughly the same on disk.