anchor compare does one LLM round-trip and returns a narrative
(post 4). That’s enough for ~80% of
drift investigations. For the remaining 20% — where the engineer
wants the agent to follow a thread — there’s anchor compare --deep.
--deep swaps the single Qwen call for a function-calling ReAct loop:
1 | thought → tool_call → observation → thought → tool_call → observation → … → final JSON |
The planner has read-only access to four tools. It’s told to prefer
depth over breadth and stop early. A hard step cap prevents runaway
loops. This post walks through
investigator.py.
The four tools
| Tool | What it wraps | Why it exists |
|---|---|---|
recall_similar_drifts(signals, k, min_similarity) |
memory.recall_similar_drifts |
The default first move when a signal feels familiar |
get_drift_details(drift_id) |
memory.get_drift |
After recall, read a full past record before relying on it |
run_spl(spl, earliest, latest) |
splunk_client.run_search (capped at 50 rows) |
Evidence-gathering: deploy logs, host breakdowns, audit |
list_recent_drifts(limit, outcome) |
memory.list_drifts |
Situational awareness when nothing recalls |
A few principles in that list:
- Every tool wraps deterministic code. The planner can’t make up
SPL that we then execute blind —
run_splgoes through the samesplunk_client.run_searchthe diff engine uses, with the samemax_count=50cap. - Read-only. No tool mutates KV Store. The planner can’t accidentally apply feedback or delete an anchor.
- No “give the LLM Python”. Sandboxed shell tools are powerful and dangerous. Four narrow tools beat one wide one.
What the planner sees
Initial user payload
(_initial_payload):
1 | { |
The “initial” fields come from the regular compare that ran first.
The planner builds on that — it doesn’t redo the diff. already_recalled
tells it which past incidents the narrator already saw, so it can
choose to dig deeper into one of them or look elsewhere.
The system prompt
1 | You are Anchor's deep-investigation planner. |
Four things worth noting:
- A numbered strategy, not free-form — gives the model a default branching order. It deviates when warranted, but it has somewhere to start.
- Hard cap of 6 tool calls. Default
CONFIG.investigate_max_steps. This is the difference between “agent” and “agent loop until your Qwen bill explodes”. - Observations capped at 8 KB. If
run_splreturns a giant rowset, it’s truncated and the planner is told to narrow the SPL. This prevents one fat tool call from blowing the whole context window. - “Stop early when evidence converges.” Counter-instinct for an LLM trained on “be helpful”. Without this, the planner spends all 6 calls even when it had the answer after 2.
The loop
1 | for step_num in range(1, max_steps + 1): |
That’s the entire loop. The step_callback is what enables the live
trace in the CLI — each step prints as it lands, so the engineer sees
the planner thinking in real time. Exceptions raised by the callback
are deliberately not caught; they indicate a consumer bug, not a
planner failure.
A real (abbreviated) transcript
Here’s what anchor compare --deep looks like on the demo dataset
when the engineer is investigating checkout slowness:
1 | step 1 |
The whole loop took 4 calls, well under the 6 cap. Two of those were memory recall, one was SPL evidence-gathering, one was the final synthesis. That distribution is typical — when the diff has obvious precedents, the planner spends most of its budget on grounding rather than exploration.
Why qwen-max-latest specifically
The narrator runs on qwen-plus. The planner defaults to
qwen-max-latest (or whatever the QWEN_PLANNER_MODEL env var is set
to). The difference matters:
qwen-plusis fine for one-shot JSON narration. Cheap, fast.qwen-max-latesthas noticeably better function-calling discipline — it stops earlier, picks tools more accurately, and fabricates fewer SPL arguments.
The tier-up is justifiable because --deep is an opt-in command,
typically run on a single drift the engineer cares about. If you ran
it on every compare you’d burn through your Qwen budget — exactly the
reason it’s --deep and not the default.
Safety / robustness details
- Argument access uses
.get()everywhere. A model that hallucinates a missing required argument returns{"error": "..."}from_dispatchinstead of crashing the loop. signal_embeddingis stripped from observations. The recalled drift records have an optional 1024-dim embedding. The planner can’t reason about raw float vectors and it would eat the 8 KB observation budget. Excluded explicitly.- Truncation is loud, not silent.
_truncateappends…[truncated, N more chars]so the planner knows it didn’t see the whole thing and can choose to narrow the SPL. - The hard cap is honored — if we hit
max_stepswithout a final answer, the result is returned withtruncated=Trueand whatever observations we gathered. The CLI shows that flag in the rendered report.
Tests for an LLM loop
Testing a function-calling agent is hard. We don’t try to test that
“Qwen picks the right tool”; that’s not a property of our code.
Instead, the tests
(tests/test_investigator.py)
fake the OpenAI client and verify the plumbing:
- Tool dispatch routes each name to the right wrapper
_truncatepreserves the original length in its breadcrumbstep_callbackfires once per step- The hard cap is honored
_parse_finalsurvives malformed JSON
That’s the layer worth testing. The LLM’s judgment is best tested by
running it on real fixtures and reading the output — which is exactly
what the demo script in examples/demo_script.md
does.