Unit tests can’t tell you if your WAL survives a power loss. Only killing the process and restarting it can. Here is how to make that test small enough to run on every commit.
If there’s one test that justifies its own existence in a storage
engine, it’s “ack a write, kill the process with SIGKILL, restart,
read the write back”. Everything else — replication, transactions,
TTL — is layered on top of that one promise.
MiniKV’s version of that test is
tests/faultinject_test.go. This
post is about how it’s structured and why each piece is the way it
is.
The shape
sequenceDiagram
participant T as Test
participant Bin as minikv binary
participant FS as Disk
T->>Bin: spawn (-data ./data -serve :PORT)
Bin->>FS: open WAL + replay
T->>Bin: PUT /v1/kv/k1 (HTTP)
Bin->>FS: WAL append + fsync
Bin-->>T: 204 No Content ← the ack
T->>Bin: SIGKILL
T->>Bin: spawn (same data dir)
Bin->>FS: open WAL + replay
T->>Bin: GET /v1/kv/k1
Bin-->>T: 200 + value ← the contract
Four things matter:
- Spawn the real binary. Not the in-process library. A
*kv.KVinside the test process can’t be killed mid-fsync without taking the test runner with it. The real binary as a subprocess can. - Drive over HTTP. Generic, ack-bearing, easy to time.
SIGKILL, notSIGTERM.SIGTERMinvites a graceful shutdown that may finish in-flight work.SIGKILLis the test.- The contract is per-acked write. Anything that returned 204 must be readable post-restart. Anything still in flight when the signal arrived may or may not survive — that’s correct behaviour, not a bug.
The runner
1 | // tests/faultinject_test.go (sketch) |
A few details:
t.TempDir()cleans up after the test, including post-SIGKILLleftover files.-serve :0asks for a kernel-assigned port; the test reads the chosen port from the server’s first log line.buildMiniKVOnceusessync.Onceso multiple tests in the same run share onego build.
Concurrent multi-writer variant
The single-writer version proves “the WAL is durable”. The harder version is:
While ten goroutines hammer PUTs concurrently, SIGKILL at a random moment. After restart, every response that was 204 must be readable.
This catches a class of bugs the single-writer test doesn’t: lock- ordering issues that only manifest when fsync happens for batch A while batch B is still in the MemTable. MiniKV’s test runs this with N writers and asserts the same per-ack contract.
The “ack means durable” definition
The ack contract is precise:
A 204 No Content from the HTTP server means the WAL has been appended and fsynced with this write’s record, before the response was written to the socket.
If anything in the stack relaxes that — a write returning before fsync, a buffered HTTP writer flushing late, a “wait for replication” that races with the WAL — the test catches it. That’s why the test is a much stronger statement than “the engine reads what it writes in steady state”.
Why HTTP, not the Go API
The Go API (kv.KV.Put) is a function call. You can’t SIGKILL a
function call. You can os.Exit the test, but then you lose the
test runner’s ability to verify anything.
HTTP is process-to-process. Killing the server is one syscall. The test process keeps running and gets to do the assertions.
It also tests the surface a real user touches, which is the right unit of “did I keep my promise?”.
What this test does not cover
- Filesystem reorderings. A real power loss can reorder writes that fsync hasn’t covered. The test doesn’t simulate that; it just trusts the OS fsync semantics. For a real engine targeting consumer hardware, ALICE or similar is the next step.
- Disk corruption. Bit flips inside WAL records would still pass CRC32C and corrupt the MemTable on replay. We don’t fault- inject torn writes inside records, only between them.
- Replication. This is a single-node durability test. The raft failover test is a separate concern.
For an MVP storage engine, the SIGKILL test is the floor. Without it, no test result means anything. With it, every other test gets to assume the durability layer holds.
Running it
1 | go test -race -count=1 ./tests -run TestSIGKILL -v -timeout 120s |
The race flag matters: most engine bugs that only show under crash are bugs in the same write path that the race detector would flag in steady state.