The trick is not “give me a consistent view” — that’s just pinning some files. The trick is “don’t let compaction delete those files while I’m reading them”.
A snapshot in an LSM engine is conceptually trivial: capture the set of “live” SSTables at a moment in time, plus a sequence number, and filter reads to that snapshot. The hard part is the interaction with the background compaction worker, which is busy producing new files and unlinking old ones.
This post is about MiniKV’s snapshot lifecycle and the “deferred deletion” pattern that makes it safe.
The race
sequenceDiagram
participant App
participant LSM
participant Compact
participant FS
App->>LSM: Snapshot()
LSM-->>App: snap (pins L0:a, L1:b)
Compact->>Compact: merge L0:a + L1:b → L1:c
Compact->>FS: unlink L0:a, L1:b
App->>LSM: Iter via snap
LSM->>FS: ReadAt(L0:a, ...)
FS-->>LSM: ENOENT
Without coordination, the iterator dereferences a file that compaction just deleted. The same race exists for point lookups and for any in-flight read.
The fix: per-SSTable atomic refcount
Every SSTable metadata struct (kv/sstable.go)
carries an atomic.Int32 refcount. The rules:
- Snapshot creation holds the LSM read lock, snapshots the live SSTable slice, and atomically increments the refcount on each pinned file.
- Compaction runs the merge under the write lock long enough to:
- install the new files into the version (refcount=0 from the outside),
- remove the input files from the version,
- atomically decrement the input refcounts.
- Decrement to zero is the only place
unlinkis allowed. If compaction’s decrement takes the count to zero, compaction unlinks. If a snapshot’sClosedoes, the snapshot unlinks.
stateDiagram-v2
[*] --> Live: created by flush/compaction
Live --> Live: snapshot ref++ / unref
Live --> Retired: compaction removes from version
Retired --> Retired: snapshots still hold refs
Retired --> [*]: refcount → 0, unlink(2)
A “retired” file is no longer reachable from the live version (so new reads can never observe it), but it still exists on disk because at least one snapshot is pinning it.
Why this is better than “lock the read path”
The naïve alternative is “compaction takes the write lock for the
duration of unlink“. This works but it serialises compaction
against every in-flight read. With a refcount:
- Reads never block compaction.
- Compaction never blocks reads.
- The only synchronised operation is the metadata swap, which is O(1).
The cost is that disk usage can temporarily exceed steady-state by
the size of all snapshot-pinned files. For long-running snapshots
this is real; the engine exposes per-snapshot age in
KV.Stats() so an operator can spot a leak.
The index cache eviction subtlety
There’s a second invariant on the same code path. The LSM keeps an index cache keyed by SSTable filename. When compaction retires a file, the cache entry must be evicted before the refcount can possibly reach zero — otherwise a stale cache entry could be the only thing keeping the file’s index in memory, and a brand-new SSTable with the same filename (because filenames cycle) could collide.
In MiniKV the order is:
1 | 1. swap version metadata under write lock |
The cache eviction has to happen before unlink for a different
reason: the cache may hold a *os.File handle, and on some
filesystems an unlink while a handle is open leaves the file alive
until the handle is closed. That’s fine on POSIX, surprising on
Windows.
Snapshot lifecycle in code
1 | snap, _ := db.Snapshot() // increments refcount on every live SSTable |
A Snapshot doesn’t pin the MemTable — it captures a sequence number
and filters by it during the read merge. That’s why opening a
snapshot is cheap (a couple of atomics) and forgetting to close one
“only” leaks SSTable files, not memory.
What this design buys you
| Property | Without refcounts | With refcounts |
|---|---|---|
| Compaction blocks reads | yes | no |
Snapshot.Get after compaction |
unsafe | safe |
| Disk usage upper bound | steady-state | steady-state + pinned |
| Per-read overhead | mutex | one atomic load |
The pattern generalises beyond LSMs: any system where readers want a stable view of a mutable set of files (immutable segments, parquet data lakes, blob-store generations) ends up with some form of refcount or generation lease. The “atomic int per file” version is the simplest one that’s actually correct.