FengHub

The Deployment: Splunk + Qwen on Alibaba Cloud in Three Commands

Jian Feng — Fri, 19 Jun 2026 01:00:00 GMT

The first five posts were about how Anchor works. This one is abouthow to put it on a server that someone other than you can talk to.

The hackathon target is Alibaba Cloud: ECS for compute, OSS fordurable memory backups, DashScope for the LLM calls Anchor alreadymakes. The walkthrough lives indeploy/alibaba-cloud.md; this postexplains why each piece is shaped the way it is.

The three-command path

Once the console-only prerequisites are done (more on those below),the entire ECS install is:

ssh root@
curl -fsSL https://raw.githubusercontent.com/faketut/Anchor/main/deploy/setup_ecs.sh | bash
nano /opt/anchor/.env       # fill in SPLUNK_PASSWORD, QWEN_API_KEY, OSS_* creds
bash /opt/anchor/deploy/verify_setup.sh

That’s the entire happy path. setup_ecs.sh is idempotent — safe tore-run after editing .env. verify_setup.sh is a pre-flightchecker that exits non-zero on any failure, so it’s usable as ahealthcheck.

What `setup_ecs.sh` actually does

The script (deploy/setup_ecs.sh)consolidates seven steps:

OS sanity check — bail if not root, bail if not Ubuntu.
apt-get install — Docker, Compose v2, git, Python venv.
git clone || git pull — checks out Anchor to /opt/anchor.
docker compose up -d with the Alibaba overlay(deploy/docker-compose.alibaba.yml)that:
- sets restart: unless-stopped so Splunk survives ECS reboots
- binds Splunk Web (8000) to localhost only — accessed via SSHtunnel, never the public internet
- leaves the mgmt API (8089) exposed for the Anchor CLI butrestricted at the security-group layer to your laptop’s IP
- removes HEC (8088) entirely (not needed for the demo flow)
- caps Splunk at 2 vCPU / 4 GB so a runaway query doesn’t OOM the box
Wait-for-Splunk loop — polls https://localhost:8089/services/server/infofor up to 120 seconds before continuing. First-boot init is theslowest step; without this wait, the next command fails on a freshinstall.
KV Store schema install — copies splunk/collections.confinto the container, chowns it, and restarts Splunk. This is theonly step that requires docker exec choreography; everythingelse is host-side.
Anchor venv + pip install -e '.[alibaba]' — sets up thePython environment for the nightly OSS backup cron.

The script ends with a printed checklist of the human next steps:edit .env, run the verifier, schedule the cron.

What `verify_setup.sh` checks

The verifier (deploy/verify_setup.sh)runs six independent checks and reports each as PASS / FAIL / SKIP:

Check	What FAIL means
`.env` exists	You haven’t filled in credentials yet
Required env vars set	`SPLUNK_PASSWORD` / `QWEN_API_KEY` placeholders still present
Splunk mgmt API reachable	Container not running, or security group blocking
All 3 KV collections present	`collections.conf` install failed; re-run `setup_ecs.sh`
OSS bucket reachable (optional)	AK pair wrong, or wrong endpoint/bucket
`anchor list` succeeds	End-to-end smoke test — the CLI ↔ KV path works

It exits non-zero on any FAIL so you can chain it: e.g. bash deploy/verify_setup.sh && systemctl restart anchor-cron.

The “optional” tag on OSS is deliberate. You can run Anchor withoutthe OSS backup; you just lose the durability guarantee. The verifiersays SKIP, not FAIL, when OSS_* env vars aren’t set.

What only humans can do

There are three things the script can’t automate, because theyrequire the Alibaba Cloud console:

Step	Why manual
Provision the ECS instance	Account-scoped action, billing implications
Open security-group ports 22 + 8089 to your laptop IP	Requires knowing your client IP — different for every dev
Create the OSS bucket + RAM user with `AliyunOSSFullAccess`	Account-scoped; RAM user creation needs human review

The walkthrough in deploy/alibaba-cloud.mdspecifies exact values:

ecs.g7.large (2 vCPU, 8 GB), Singapore or Hong Kong
Ubuntu 24.04 LTS, 60 GB ESSD
Security group: TCP 22 + TCP 8089 from your laptop’s /32, that’s it
OSS bucket: private ACL, versioning enabled so backups areimmutable

The Singapore / Hong Kong region choice matters: DashScope’sinternational endpoint has the lowest latency from those regions, andkeeps the ECS-to-Qwen call sub-100 ms.

OSS as the durability layer

KV Store data lives on the ECS instance disk. ECS disks are durable(triple-replicated), but the blast radius is one instance. If youaccidentally docker compose down -v (the -v removes volumes), youranchors and drift history are gone.

deploy/backup_kv_to_oss.pysolves that with a 60-line script:

def dump_kv() -> dict:
    """Snapshot all three collections to a single JSON dict."""
    svc = connect()
    return {
        "anchors":        list(svc.kvstore["anchors"].data.query()),
        "drift_history":  list(svc.kvstore["drift_history"].data.query()),
        "signal_weights": list(svc.kvstore["signal_weights"].data.query()),
        "snapshot_at":    datetime.now(timezone.utc).isoformat(),
    }

def upload_to_oss(payload: dict) -> str:
    auth   = oss2.Auth(os.environ["OSS_ACCESS_KEY_ID"],
                       os.environ["OSS_ACCESS_KEY_SECRET"])
    bucket = oss2.Bucket(auth, os.environ["OSS_ENDPOINT"],
                                os.environ["OSS_BUCKET"])
    key = f"anchor-memory/{datetime.now(timezone.utc).isoformat()}.json"
    bucket.put_object(key, json.dumps(payload).encode("utf-8"),
                      headers={"x-oss-server-side-encryption": "AES256"})
    return key

The wiring: oss2.Auth → oss2.Bucket → put_object withserver-side AES256 encryption. That’s the Alibaba Cloud API usagethe hackathon rules want as proof — a single file that imports theAlibaba SDK, authenticates with RAM credentials, and pushes data intothe platform.

Scheduled via cron:

1
2
3

0 3 * * * cd /opt/anchor \
  && set -a; . ./.env; set +a \
  && .venv/bin/python deploy/backup_kv_to_oss.py >> /var/log/anchor-backup.log 2>&1

With bucket versioning enabled, every nightly run is preserved.Restore is oss-util cp oss://... ./ plus a small replay script thatreads the JSON and writes each row back via kv_insert. The repodoesn’t ship the restore script because nobody needs it for the demo;it’s ~30 lines of Python when someone does.

The three Qwen Cloud surfaces, on the same backend

The ECS install gets you the CLI. The MCP server and Qwen Custom Skillshare the same Splunk backend — they’re just different transports:

Surface	How to bring it up
CLI	already installed by `setup_ecs.sh`
MCP server (stdio)	`pip install -e '.[mcp]' && anchor-mcp` — plug into Claude Desktop or Cursor
Custom Skill (HTTP)	`pip install -e '.[skill]' && uvicorn anchor.skill_server:app` — register `deploy/qwen_skill/anchor-skill.yaml` in Qwen Cloud → Application Center

That’s by design. The “application” is the SPL + KV layer(posts 2 and 3); thesurfaces are interchangeable. Adding a Slack bot, a Discord bot, or aPagerDuty webhook is the same pattern: thin transport, call intoagent.compare(), render the CompareResult.

What’s deliberately not in the demo deploy

A short list of things that would be on the checklist for a realproduction deploy but were intentionally cut for the hackathon:

Skipped	Why
TLS on the mgmt API (Caddy / Let’s Encrypt)	`SPLUNK_VERIFY_SSL=false` is fine for a 30-second judge demo. Documented in `alibaba-cloud.md` as a real-deploy requirement.
systemd unit for the cron	Crontab is fine for a once-a-day backup. Adding systemd adds a unit file with zero behavioral change.
RAM policy scoped to the single bucket	`AliyunOSSFullAccess` is broader than necessary. Real deploy should scope to `oss:PutObject` on `anchor-memory/*`.
Multi-AZ Splunk replication	Single ECS instance is fine for a demo. Splunk SHC is several weeks of work.
Skill-server behind a reverse proxy	The skill server has bearer-token auth via `secrets.compare_digest`, which is the right primitive — but it’s running on `0.0.0.0:8000`. For real use, front it with Caddy + TLS.

The principle: ship the simplest thing that demonstrates thecapability. Document the production gaps honestly.

Validating the proof

The hackathon submission asks for two things:

A URL to a code file demonstrating Alibaba Cloud API usage. Thatfile is deploy/backup_kv_to_oss.py.60 lines, imports oss2, authenticates with RAM AK, uploads withserver-side encryption. Direct mapping from the rules to the code.
Evidence the backend runs on Alibaba Cloud. The 30-seconddemo video walks: Alibaba Cloud console showing the ECS instance →SSH into it → docker ps showing Splunk → curl https://localhost:8089/services/server/info returning a 200 →OSS console showing the anchor-memory/*.json objects → laptopside anchor compare against the ECS public IP.

That second list is the checklist at the bottom ofdeploy/alibaba-cloud.md.

Wrapping the series

Six posts in:

Post 1 — the on-call problem and thethree-memory framing.
Post 2 — five SPL queries → one KV row.
Post 3 — the diff engine and decaytoward 1.0.
Post 4 — the LLM only narrates.
Post 5 — the optional ReAct planner.
Post 6 — the deploy you’re reading.

The common thread across all six: most of the agent’s value lives inthe deterministic core; the LLM is an edge component. That’s whyranking, recall, decay, and the planner’s tool restrictions allmatter more than the prompt itself.

If you’d build this differently — or you’ve shipped something similarand want to compare notes — open an issue onfaketut/Anchor.

The Planner: Function-Calling for SRE Drill-Down

Jian Feng — Thu, 18 Jun 2026 01:00:00 GMT

anchor compare does one LLM round-trip and returns a narrative(post 4). That’s enough for ~80% ofdrift investigations. For the remaining 20% — where the engineerwants the agent to follow a thread — there’s anchor compare --deep.

--deep swaps the single Qwen call for a function-calling ReAct loop:

1	thought → tool_call → observation → thought → tool_call → observation → … → final JSON

The planner has read-only access to four tools. It’s told to preferdepth over breadth and stop early. A hard step cap prevents runawayloops. This post walks throughinvestigator.py.

The four tools

Tool	What it wraps	Why it exists
`recall_similar_drifts(signals, k, min_similarity)`	`memory.recall_similar_drifts`	The default first move when a signal feels familiar
`get_drift_details(drift_id)`	`memory.get_drift`	After recall, read a full past record before relying on it
`run_spl(spl, earliest, latest)`	`splunk_client.run_search` (capped at 50 rows)	Evidence-gathering: deploy logs, host breakdowns, audit
`list_recent_drifts(limit, outcome)`	`memory.list_drifts`	Situational awareness when nothing recalls

A few principles in that list:

Every tool wraps deterministic code. The planner can’t make upSPL that we then execute blind — run_spl goes through the samesplunk_client.run_search the diff engine uses, with the samemax_count=50 cap.
Read-only. No tool mutates KV Store. The planner can’taccidentally apply feedback or delete an anchor.
No “give the LLM Python”. Sandboxed shell tools are powerfuland dangerous. Four narrow tools beat one wide one.

What the planner sees

Initial user payload(_initial_payload):

{
  "planner_version": 1,
  "anchor_name": "Healthy Week",
  "compare_window": {"start": "...", "end": "..."},
  "initial_summary": "p95 latency tripled and a new PaymentGateway template appeared...",
  "initial_hypothesis": "downstream payment-svc degradation",
  "top_diffs": [
    {"signal": "...", "severity": "HIGH", "delta_pct": 299.4, "note": "..."}
    // up to 10
  ],
  "already_recalled": [
    {"id": "7db2d8aa", "outcome": "resolved", "similarity": 0.71}
  ]
}

The “initial” fields come from the regular compare that ran first.The planner builds on that — it doesn’t redo the diff. already_recalledtells it which past incidents the narrator already saw, so it canchoose to dig deeper into one of them or look elsewhere.

The system prompt

You are Anchor's deep-investigation planner.

You receive an initial CompareResult: anchor name + top diffs + an
initial narration. Your job is to deepen the investigation using the
tools provided, then return a tighter root-cause hypothesis with an
evidence chain.

Strategy:
1. If diffs contain a new template or a metric spike, call
   recall_similar_drifts on those signals to find precedents.
2. If a precedent has outcome=resolved with a confirmed_reason, call
   get_drift_details to read its full record before relying on it.
3. If you suspect a deploy/config change, call run_spl against relevant
   indexes (e.g. deploy_log, config_change, audit) within the compare
   window.
4. Stop and finalize as soon as you have a defensible hypothesis. You
   have up to 6 tool calls — prefer depth over breadth and stop early
   when evidence converges.

Tool observations are clipped at ~8 KB; if you need more, narrow the SPL.

Four things worth noting:

A numbered strategy, not free-form — gives the model a defaultbranching order. It deviates when warranted, but it has somewhereto start.
Hard cap of 6 tool calls. Default CONFIG.investigate_max_steps.This is the difference between “agent” and “agent loop until yourQwen bill explodes”.
Observations capped at 8 KB. If run_spl returns a giant rowset,it’s truncated and the planner is told to narrow the SPL. Thisprevents one fat tool call from blowing the whole context window.
“Stop early when evidence converges.” Counter-instinct for anLLM trained on “be helpful”. Without this, the planner spends all6 calls even when it had the answer after 2.

The loop

for step_num in range(1, max_steps + 1):
    rsp = client.chat.completions.create(
        model=model, messages=messages,
        tools=TOOLS, tool_choice="auto", temperature=0.1,
    )
    msg = rsp.choices[0].message
    messages.append(_serialize_assistant(msg))

    tool_calls = getattr(msg, "tool_calls", None)
    if not tool_calls:
        return _parse_final(msg.content or "", steps, truncated=False)

    # dispatch each tool call, append observation as a `tool` message
    for tc in tool_calls:
        observation = _dispatch(tc.function.name, json.loads(tc.function.arguments))
        messages.append({
            "role": "tool", "tool_call_id": tc.id,
            "name": tc.function.name, "content": _truncate(observation),
        })
        steps.append(InvestigationStep(...))
        if step_callback:
            step_callback(steps[-1])

That’s the entire loop. The step_callback is what enables the livetrace in the CLI — each step prints as it lands, so the engineer seesthe planner thinking in real time. Exceptions raised by the callbackare deliberately not caught; they indicate a consumer bug, not aplanner failure.

A real (abbreviated) transcript

Here’s what anchor compare --deep looks like on the demo datasetwhen the engineer is investigating checkout slowness:

step 1
  thought:   "p95 latency tripled and a new PaymentGatewayTimeout
              template appeared. Both look related to the payment
              path. Let me recall past drifts with overlapping signals."
  tool:      recall_similar_drifts(
               signals=["template:appeared:PaymentGatewayTimeout #4a",
                        "metric:latency_ms:p95"], k=5)
  obs:       [{"id":"7db2d8aa","outcome":"resolved","similarity":0.71,
               "confirmed_reason":"payment-svc 2.4.1 timeout regression"}]

step 2
  thought:   "High overlap with a resolved deploy regression. Reading
              the full record to see what evidence they used."
  tool:      get_drift_details(drift_id="7db2d8aa")
  obs:       {"top_diffs":[...], "suggested_spl":
              "index=deploy_log service=payment-svc"}

step 3
  thought:   "If this is the same class of incident I should see a
              deploy event for payment-svc in the compare window."
  tool:      run_spl(spl="index=deploy_log service=payment-svc",
                     earliest="2026-06-02T00:00:00",
                     latest="2026-06-03T00:00:00")
  obs:       {"row_count": 1, "rows":[{"_time":"2026-06-02T17:42Z",
              "service":"payment-svc","version":"2.4.7","actor":"ci"}]}

step 4  (final, no tool_calls)
  {
    "summary": "p95 latency tripled at 17:42, coincident with payment-svc
                deploy 2.4.7. New PaymentGatewayTimeout template matches
                resolved incident 7db2d8aa (payment-svc 2.4.1 timeout
                regression).",
    "hypothesis": "deploy regression in payment-svc 2.4.7",
    "evidence": [
      "recall: incident 7db2d8aa had Jaccard 0.71 on payment timeout signals",
      "deploy_log: payment-svc 2.4.7 deployed at 2026-06-02T17:42Z within
       compare window"
    ],
    "confidence": 0.8
  }

The whole loop took 4 calls, well under the 6 cap. Two of those werememory recall, one was SPL evidence-gathering, one was the finalsynthesis. That distribution is typical — when the diff has obviousprecedents, the planner spends most of its budget on grounding ratherthan exploration.

Why `qwen-max-latest` specifically

The narrator runs on qwen-plus. The planner defaults toqwen-max-latest (or whatever the QWEN_PLANNER_MODEL env var is setto). The difference matters:

qwen-plus is fine for one-shot JSON narration. Cheap, fast.
qwen-max-latest has noticeably better function-callingdiscipline — it stops earlier, picks tools more accurately, andfabricates fewer SPL arguments.

The tier-up is justifiable because --deep is an opt-in command,typically run on a single drift the engineer cares about. If you ranit on every compare you’d burn through your Qwen budget — exactly thereason it’s --deep and not the default.

Safety / robustness details

Argument access uses .get() everywhere. A model that hallucinatesa missing required argument returns {"error": "..."} from _dispatchinstead of crashing the loop.
signal_embedding is stripped from observations. The recalleddrift records have an optional 1024-dim embedding. The planner can’treason about raw float vectors and it would eat the 8 KB observationbudget. Excluded explicitly.
Truncation is loud, not silent. _truncate appends…[truncated, N more chars] so the planner knows it didn’t see thewhole thing and can choose to narrow the SPL.
The hard cap is honored — if we hit max_steps without a finalanswer, the result is returned with truncated=True and whateverobservations we gathered. The CLI shows that flag in the renderedreport.

Tests for an LLM loop

Testing a function-calling agent is hard. We don’t try to test that“Qwen picks the right tool”; that’s not a property of our code.Instead, the tests(tests/test_investigator.py)fake the OpenAI client and verify the plumbing:

Tool dispatch routes each name to the right wrapper
_truncate preserves the original length in its breadcrumb
step_callback fires once per step
The hard cap is honored
_parse_final survives malformed JSON

That’s the layer worth testing. The LLM’s judgment is best tested byrunning it on real fixtures and reading the output — which is exactlywhat the demo script in examples/demo_script.mddoes.

The Narrator: Putting the LLM Only at the Edge

Jian Feng — Wed, 17 Jun 2026 01:00:00 GMT

By the time Qwen sees a request, the hard work is already done. Thediff engine (post 3) has ranked the topdiffs. The recall system has fetched the top-3 most similar pastincidents. The LLM’s job is narration: turn structured rows into a2-4 sentence summary, a hypothesis, and one drill-in SPL query.

This post walks through narrator.pyand the design choices that keep it cheap, reproducible, and easy toaudit.

The full prompt, verbatim

SYSTEM_PROMPT = """You are Anchor, an observability assistant for Splunk.
You are given a set of statistical diffs between a HEALTHY baseline window
(the "anchor") and a CURRENT window being investigated. You may also be
given PAST_INCIDENTS — previously-investigated drifts with confirmed
outcomes whose signals overlap with the current one.

Your job:
1. Write a 2-4 sentence SUMMARY in plain English describing what changed.
   Lead with the highest-severity diffs. Quantify deltas.
2. Propose a single best HYPOTHESIS for the likely cause class
   (e.g. "downstream service degradation", "new error class", "traffic shift",
    "deploy regression"). If a PAST_INCIDENT with outcome=resolved has high
   signal overlap, you SHOULD reference it (by its short id) and lean on its
   confirmed_reason. If the past incident was a false_positive, downweight
   your concern accordingly.
3. Suggest one DRILL_IN SPL query the engineer should run next to confirm.

Be concise. Do NOT invent diffs not in the input. Do NOT claim root cause
with certainty — use words like "likely", "suggests", "consistent with".

Respond as a JSON object with exactly these keys:
  summary (string), hypothesis (string or null), drill_in_spl (string or null).
"""

A few things deliberately not in this prompt:

No examples / few-shot. The output schema is strict JSON; examplesbloat the prompt without changing quality.
No “think step by step”. The deterministic core already did thethinking. We want narration, not chain-of-thought.
No persona (“You are an expert SRE…”). The role is system; that’sthe persona. Verbose personas pull the model toward filler.
No claim of certainty. The “use words like ‘likely’, ‘suggests’”instruction is the cheapest hallucination-mitigation we have.

What the model sees as input

The user message is JSON, not prose(_payload()):

{
  "prompt_version": 2,
  "anchor_name": "Healthy Week",
  "diffs": [
    {
      "signal": "template:appeared:PaymentGatewayTimeout #4a",
      "kind": "template", "severity": "HIGH",
      "anchor_val": 0.0, "current_val": 148,
      "delta_pct": null,
      "note": "new pattern (_json): timeout calling stripe.payment.charge"
    },
    {
      "signal": "metric:latency_ms:p95",
      "kind": "metric", "severity": "HIGH",
      "anchor_val": 312.4, "current_val": 1247.8,
      "delta_pct": 299.4, "note": ""
    }
    // up to 15 diffs
  ],
  "past_incidents": [
    {
      "id": "7db2d8aa",
      "when": "2026-04-12T19:03Z",
      "outcome": "resolved",
      "confirmed_reason": "payment-svc 2.4.1 timeout regression, rolled back",
      "signal_overlap": 0.71,
      "signals": ["template:appeared:PaymentGatewayTimeout #4a", "metric:latency_ms:p95"]
    }
    // up to 3 past incidents
  ],
  "focus": "checkout slowness"
}

Three small choices worth flagging:

prompt_version: 2 in the payload. When the prompt or schemachanges, the version bumps. Drift records store this implicitly viathe response shape, so audits can reproduce “which prompt producedthis hypothesis?”.
anchor_val / current_val are raw numbers, not formattedstrings. Lets the model quantify deltas without us pre-baking“3.0×” prose.
past_incidents is bounded at 3. Not 10, not “all relevant”.The Track-1 requirement is recalling critical memories withinlimited context. Three is enough for grounding without crowdingout the diffs.

What the model has to return

rsp = client.chat.completions.create(
    model=model,
    response_format={"type": "json_object"},
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": _payload(diffs, focus, anchor_name, past_incidents)},
    ],
    temperature=0.2,
)

The combination of response_format={"type": "json_object"} andtemperature=0.2 is the whole reliability story:

JSON mode means Qwen returns syntactically valid JSON everytime. No retry loop, no markdown-fence stripping, no regexextraction.
Low temperature keeps the narration boring in a good way. Thesame diffs produce the same summary across runs. SREs are notlooking for creative writing.

The parsing on the other side is correspondingly mundane:

data = json.loads(raw)
return NarratorResponse(
    summary=data.get("summary", "").strip() or "(empty)",
    hypothesis=(data.get("hypothesis") or None),
    drill_in_spl=(data.get("drill_in_spl") or None),
)

No data["summary"] — every key uses .get(..., default). If Qwengets weird, we degrade to a sensible empty value instead of throwing.

Provider abstraction (the small one)

Qwen and Gemini both expose OpenAI-compatible chat completionsendpoints. So Anchor’s “multi-provider” support is one functionparameterized over base URL, API key, and model name:

def _openai_compat_narrate(diffs, focus, anchor_name, *,
                           api_key, base_url, model, ...):
    from openai import OpenAI
    client = OpenAI(api_key=api_key, base_url=base_url, timeout=LLM_TIMEOUT_S)
    ...

narrate() is a five-line switch picking which _openai_compat_narrateto call. There used to be a third branch for a hypothetical Splunk-hostedmodel; it was dead code and got deleted in code review. The principle:if no path through your code is exercised, the code is wrong.

Where the LLM is in the bigger picture

Looking at the system overview from the project README,the LLM sits at the edge of the data pipeline, never in the middle:

1
2
3

Splunk → fingerprint → diff (weighted) → recall (Jaccard or cosine) → Qwen → user
                       ^^^^^^^^^^^^^^^   ^^^^^^^^^^^^^^^^^^^^^^^^^^^   ^^^^
                       deterministic     deterministic                  narration

That layering buys us four things, none of which a pure-LLM agentgets:

Property	Why we get it
Reproducibility	Same window → same top diffs → same prompt input
Bounded cost	One LLM call per compare, fixed-size payload
Auditability	`drift_history` stores both the structured diffs and the prose; if Qwen was wrong, the structured data is still there
Graceful degradation	If Qwen is down, you still see the ranked diffs in the rendered report; the narration just says “(empty)”

The same principle applies to the optional --deep planner inpost 5: the model only gets to calltools that wrap deterministic code. It never gets to make up SPLthat we then execute blind.

What we deliberately don’t do

No streaming. The CLI waits for the full JSON response. Streamingpartial JSON is a parsing headache and the time saved is dwarfed bythe SPL queries that ran before the LLM call anyway.
No re-ranking by the model. We send the top-15 already ranked.We don’t ask the model to re-rank; we ask it to narrate the existingranking. The diff engine is the source of truth, not Qwen.
No tool calls in the basic narrator. That’s the planner’s job(post 5). Keeping the basic narratortool-free means anchor compare is always one LLM round-trip andthe latency is predictable.
No retries on JSON parse failure. With JSON mode + temperature0.2 this hasn’t happened in months of testing. If it ever does, thefallback returns (empty) and the engineer sees the structured diffs.Better than a hidden retry loop adding latency.

The cost shape

For a normal anchor compare on the demo dataset:

Component	Approximate cost
5 SPL queries	~250 ms total
Diff engine (pure Python)	< 10 ms
Recall (Jaccard over ~500 rows)	< 50 ms
One Qwen `qwen-plus` call	~1.5-3 s
KV write of new drift record	~30 ms

The LLM is the dominant tail. Everything else is well below humanperception. If you wanted to speed Anchor up, you’d move fromqwen-plus to qwen-turbo — not refactor the pipeline.

The Diff: Ranking Severity by What We've Learned Matters

Jian Feng — Tue, 16 Jun 2026 01:00:00 GMT

The diff engine (diff.py) is the mostboring file in the repo on purpose. It’s ~250 lines of pure functionswith zero LLM calls, zero network I/O, and zero hidden state. Given ananchor fingerprint and a current one, it returns a ranked list ofDiffEntry rows. That’s it.

The interesting part is what gets multiplied on top of those rowsright before ranking.

Three diffs in, ranked list out

def diff_all(anchor, current, weights=None, *, limit=20):
    weights = weights or {}
    entries = (
        volume_diff(anchor, current)
        + template_diff(anchor, current)
        + metric_diff(anchor, current)
    )
    def rank(e):
        base = SEV_ORDER[e.severity]
        w = weights.get(e.signal, SignalWeight(signal_name=e.signal)).weight
        mag = abs(e.delta_pct or 0.0) / 100.0
        return base * w + mag * 0.01
    entries.sort(key=rank, reverse=True)
    return entries[:limit]

That base * w + mag * 0.01 is the entire learned-ranking story:

base is a 1 / 2 / 3 score from LOW / MEDIUM / HIGH.
w is the learned weight for this signal (default 1.0, floor 0.1,cap 3.0).
mag is a tiny tiebreaker so a 500% change ranks above a 51% changeat the same severity tier.

The result: if template:payment_4xx_upstream has been confirmedfive times in the last quarter, its w is around 1.5. When it showsup again at MEDIUM severity, it ranks ahead of a HIGH-severityvolume:foo change with w = 1.0. That’s the system telling you:“You’ve cared about this before. Look here first.”

The three classes of diff

Volume diff

Per-sourcetype event counts. Two interesting edge cases:

if a == 0 and c == 0:
    continue                        # both zero — not interesting
delta = _pct_change(a, c)
if delta is None:                   # anchor was 0, current > 0
    out.append(DiffEntry(..., delta_pct=None, severity="HIGH",
                         note="new sourcetype"))

The “anchor was 0, current is positive” case used to return somemagic percent. That’s been wrong since the first review — there’s nohonest percent change from zero. The fix:

Return None for delta_pct and have the renderer print newinstead of a fabricated number.

The diff engine’s only job is to surface signal; lying aboutdivisions-by-zero adds noise.

Template diff

Three sub-cases against the log_patterns list frompost 2:

Set operation	Signal name	Severity
in current, not in anchor	`template:appeared:`	HIGH if `count > 10`, else MEDIUM
in anchor, not in current	`template:disappeared:`	MEDIUM
in both, frequency shifted ≥ 50%	`template:shifted:`	derived from delta

The is a stable id. It’s the first 32 chars of the templateplus a 6-char MD5 suffix:

def _short(template: str, n: int = 32) -> str:
    suffix = hashlib.md5(
        template.encode("utf-8", errors="replace"),
        usedforsecurity=False,
    ).hexdigest()[:6]
    head = template[:n] if len(template) <= n else template[:n] + "..."
    return f"{head}#{suffix}"

Two distinct templates that share a 32-char prefix used to collapseinto the same signal name (and therefore the same learned weight).The MD5 suffix fixes that without losing the human-readable head.(usedforsecurity=False placates Bandit; MD5 here is a hash, not acrypto primitive.)

Metric diff

For each metric named in --metric latency_ms, we already capturedp50/p95/p99 in the fingerprint. The diff compares each percentileindividually:

for pct in ("p50", "p95", "p99"):
    a_val = getattr(a_stats, pct)
    c_val = getattr(c_stats, pct)
    delta = _pct_change(a_val, c_val)
    if delta is None or abs(delta) < LOW_DELTA:
        continue
    out.append(DiffEntry(signal=f"metric:{name}:{pct}", ...))

That < LOW_DELTA (50%) filter is intentional. A p95 that moved 12%is statistical noise on a one-day window; we don’t want to fill thetop diffs with it.

The weights: how Anchor learns

Three constants govern the entire feedback loop(memory.py):

1
2
3

WEIGHT_DELTA = 0.1     # +0.1 on confirmed, -0.2 on false_positive
WEIGHT_MIN   = 0.1     # never zero — a signal can always recover
WEIGHT_MAX   = 3.0     # never dominant — diversity matters

When you run anchor feedback --outcome resolved, every signalin that drift’s top_diffs gets weight += 0.1. On--outcome false_positive, every signal gets weight -= 0.2 (theasymmetry is deliberate — false positives are more painful than missedcatches, so the penalty bites harder).

That alone would be enough to learn. The harder problem isforgetting.

Timely forgetting: weights decay halfway every 30 days

The Track-1 hackathon requirement says “timely forgetting ofoutdated information”. Anchor implements that as exponential decaytoward the neutral value 1.0:

DECAY_HALF_LIFE_DAYS = 30.0

def decay_weights(now, half_life_days=DECAY_HALF_LIFE_DAYS):
    skip_cutoff = now - timedelta(hours=DECAY_SKIP_RECENT_HOURS)
    for d in kv_all("signal_weights"):
        w = SignalWeight.model_validate(d)
        if w.last_updated and w.last_updated > skip_cutoff:
            continue                                # too fresh, don't decay
        age_days = (now - w.last_updated).total_seconds() / 86400.0
        factor = 0.5 ** (age_days / half_life_days)
        new_weight = 1.0 + (w.weight - 1.0) * factor
        ...

Read that line by line:

factor = 0.5 ** (age_days / half_life_days) — classichalf-life. After 30 idle days, factor is 0.5. After 60 days, 0.25.After 90 days, 0.125.
new_weight = 1.0 + (w.weight - 1.0) * factor — pulls theweight toward 1.0, never past it. A weight at 1.5 decays to1.25 at 30 days, 1.125 at 60 days, etc.
w.last_updated > skip_cutoff — the 24-hour grace windowprevents a freshly-confirmed signal from being immediately washedout by decay on the next compare.

There’s a subtle invariant in the caller(agent.compare) worth flagging:

Always call get_weights() BEFORE bump_appearance().

get_weights() triggers decay-and-write. bump_appearance() thenwrites appearance counters. If you reverse them, you’d overwrite thedecayed weight value with a stale snapshot. The docstring onbump_appearance calls this out explicitly because it’s the kind ofbug a future refactor would silently reintroduce.

A small operational detail

A previous schema didn’t have last_updated. Rows written under thatschema can’t decay (we don’t know when they were last touched). Ratherthan fabricate a date, decay_weights counts them and emits a one-shotbreadcrumb to stderr:

1 2	anchor: 4 signal_weights row(s) have no `last_updated` and will not decay; run `anchor feedback` on the corresponding signal once to backfill.

The first feedback call on each backfills last_updated. After thatthey participate in decay like everyone else. No migration scriptneeded — the system heals itself in normal use.

What this looks like to the engineer

Run anchor learned to see the current weight table sorted bydeviation from 1.0:

SIGNAL                                       WEIGHT  CONFIRMED  FALSE_POS  LAST_UPDATED
template:appeared:PaymentGatewayTimeout #4a  2.10    9          0          2026-06-12 14:30Z
metric:latency_ms:p95                        1.45    4          0          2026-06-14 09:11Z
template:shifted:GC_pause_long #d2           0.62    0          3          2026-06-08 22:04Z
template:appeared:DebugLogEntry #91          0.10    0          7          2026-05-29 17:55Z

That table is the system’s memory in human-readable form. The firsttwo are learned signal — pay attention. The last two are learnednoise — please stop alerting on this. Without decay, the noise rowswould stay at 0.1 forever even after the underlying issue is fixed.With decay, they’ll drift back toward 1.0 over a few months — and thenext time the same template legitimately appears, the engineer’sfeedback re-bias-ifies it from scratch.

Why this matters for the LLM

The narrator in post 4 only sees thetop 15 diffs (diff_all(..., limit=15)). So the ranking is the mostconsequential piece of state in the whole pipeline. Get the rankingright and the LLM has a fighting chance. Get it wrong — by leavingweights flat at 1.0 forever, say — and Qwen ends up narrating noise.

The weight table is the system getting smarter across sessions.

The Fingerprint: Turning a Healthy Week into a Row in KV Store

Jian Feng — Mon, 15 Jun 2026 01:00:00 GMT

When you run

1
2
3

anchor capture --name "Healthy Week" \
  --from 2026-05-20T00:00:00 --to 2026-05-27T00:00:00 \
  --index main --metric latency_ms

…the CLI does two things: it runs five SPL queries against Splunkto characterize the window, and it writes one document into theanchors KV Store collection. This post unpacks both halves.

What’s in a fingerprint

The Fingerprint model(models.py) carries five fields:

Field	What it captures	SPL flavor
`event_volume`	per-sourcetype counts, total, hourly profile	`stats count by sourcetype` + `bin _time span=1h`
`log_patterns`	top-N “shape” buckets via Splunk’s built-in `punct` field	`stats count, values(_raw) by punct \| sort -count \| head 50`
`error_rates`	error / warn / info ratio per sourcetype	`eval _lvl=case(...) \| stats sum(eval(...))`
`key_metrics`	p50/p95/p99/mean/stddev for named numeric fields	`stats perc50(x) as x_p50, perc95(x) as x_p95, ...`
`top_hosts`	top-20 hosts by event count	`top limit=20 host`

That’s deliberately a small feature set. Anchor isn’t trying to be anML platform; it’s trying to capture the cheapest possible summary thatstill discriminates “yesterday looked like the baseline” from“yesterday is different and here’s how”.

Why `punct` instead of clustering

The cheapest log-template proxy in Splunk is the built-in punctfield — it’s the punctuation skeleton of the event, computed at indextime. [ERROR] payment 4xx upstream svc=stripe id=... and the sameline with a different request id collapse to the same punct. Noclustering library, no Levenshtein, no LLM call.

That decision shows up directly in the SPL builder(fingerprint.py):

def _spl_patterns(scope: Scope) -> str:
    base = _index_filter(scope)
    return (
        f"{base} | eval _punct=if(isnull(punct),\"\",punct) "
        f"| stats count, values(sourcetype) as sourcetype, "
        f"       values(_raw) as examples by _punct "
        f"| sort -count | head 50 "
        f"| eval example=mvindex(examples,0), "
        f"       sourcetype=mvindex(sourcetype,0)"
    )

The head 50 cap is intentional: we want the top-N representativepatterns, not every long-tail one-off. If a new pattern enters thetop-50 in a future window, that’s a template:appeared:... signal inpost 3‘s diff engine. If a known patternfalls out of the top-50, that’s template:disappeared:....

The trust boundary: SPL injection

The CLI accepts --index foo --sourcetype bar --metric x. Thosetokens get spliced into SPL strings. That’s exactly the place amalicious value like 'foo;|delete' could try to escape the searchcontext.

Defense in depth: a whitelist of safe identifier characters, appliedto every token before it touches SPL:

_TOKEN_RE = re.compile(r"^[A-Za-z0-9_*\-]+$")

def _safe_token(s: str, kind: str) -> str:
    if not _TOKEN_RE.match(s):
        raise ValueError(f"unsafe {kind} token: {s!r}")
    return s

The CLI is the trust boundary, but a defence-in-depth whitelist coststwo lines and closes a footgun.

From `Fingerprint` to KV row

The persistence layer(memory.py) wraps the fingerprint inan Anchor envelope, assigns a UUID, and writes one document:

def save_anchor(name, start, end, scope, fp) -> Anchor:
    ensure_collections()
    anchor = Anchor(
        id=str(uuid.uuid4()),
        name=name,
        created_at=datetime.now(timezone.utc),
        created_by=getpass.getuser(),
        time_range=TimeRange(start=start, end=end),
        scope=scope,
        fingerprint=fp,
    )
    doc = json.loads(anchor.model_dump_json())
    doc["_key"] = anchor.id
    kv_insert("anchors", doc)
    return anchor

Two small things worth noting:

ensure_collections() is idempotent. The first anchor captureon a fresh Splunk creates the three collections; subsequent callsare a no-op. This is what makes the setup_ecs.sh install inpost 6 survive re-runs.
_key = anchor.id. KV Store auto-assigns a key if you don’t,but we want the UUID to be the key so kv_get("anchors", id) isa direct lookup rather than a query.

What an anchor looks like in JSON

Roughly:

{
  "_key": "8d3a...",
  "id":   "8d3a...",
  "name": "Healthy Week",
  "created_at": "2026-05-27T18:42:11Z",
  "created_by": "fenjian",
  "time_range": {"start": "2026-05-20T00:00:00Z", "end": "2026-05-27T00:00:00Z"},
  "scope": {"indexes": ["main"], "sourcetypes": []},
  "fingerprint": {
    "event_volume":  {"per_source": {"_json": 412380}, "total": 412380, "hourly_profile": [...]},
    "log_patterns":  [{"template": "...", "frequency_pct": 18.4, "count": 75880, ...}, ...],
    "error_rates":   {"_json": {"error_count": 142, "warn_count": 503, "total": 412380}},
    "key_metrics":   {"latency_ms": {"p50": 78.1, "p95": 312.4, ...}},
    "top_hosts":     [{"host": "checkout-7d4b...", "event_count": 41280}, ...]
  }
}

You can inspect this in the Splunk Web UI under Settings → Lookups →KV store collections → anchors. Useful sanity check on firstinstall.

What this buys us

Two superpowers, one each for the next two posts:

Post 3 — every later compare re-runs the same five SPL querieson a different window, produces a second Fingerprint, and thediff engine subtracts the two. Pure functions, no LLM.
Post 4 — when the LLM eventually does see the data, it sees theranked diff, not raw logs. That keeps the prompt small and thecost bounded.

What we didn’t include (and why)

No raw events. Anchor stores statistics, not log payloads. PIIstays in your indexers; the fingerprint is safe to ship anywhere.
No embeddings on the anchor itself. We embed signals (post 3),not raw text. One embedding per drift, not per event.
No “trends”. A baseline is a single window. If you want tocapture weekly seasonality, capture multiple baselines and pick theone whose scope matches the compare window. Simpler thangeneralizing.

Why a MemoryAgent for on-call

Jian Feng — Sun, 14 Jun 2026 01:00:00 GMT

The 2 a.m. problem

Every on-call engineer has had this moment: pager goes off, you openSplunk, you stare at a wall of graphs, and the first 15–20 minutesevaporate into the same question:

“Wait — what does normal even look like for this service?”

You’d think tools would solve this by now. They don’t, because theysolve adjacent problems:

Anomaly detection trains on a sliding window of recent history.If your service has been quietly degrading for a week, “recent” isalready drifted; the model thinks today’s badness is normal.
LLM chatbots answer “is this weird?” once, then forget. Thenext compare starts from zero.
Static dashboards show you the numbers but don’t say what’sdifferent. You’re still the one doing the diff in your head.

Anchor’s bet is that what an SRE actually wants is closer togit diff than to kibana --auto-detect. Pick a reference state,compare a window against it, and get a narrative about the delta —not just the delta itself.

The three memories

For that narrative to get better over time, the agent has to rememberthree things. Each lives in a separateSplunk KV Storecollection:

Memory	Collection	What it does
What “healthy” looked like	`anchors`	A human-curated baseline. Survives raw-log retention. Diff against this, not against yesterday.
Which signals actually matter	`signal_weights`	Re-ranks diffs by accumulated feedback. Confirmed signals weigh more; false positives weigh less.
What we did about it last time	`drift_history`	Every past compare, with engineer-confirmed reasons attached. Recall the most similar one on every new compare.

This is the MemoryAgent loop in one sentence:

Each compare reads signal_weights (learned ranking) anddrift_history (recalled past incidents) before calling the LLM,then writes a new drift record. Each feedback updatessignal_weights.

Where the LLM fits

The LLM is not the decision layer. Look at the compare lifecycle:

CLI → KV: load anchor + weights (apply decay)
CLI → Splunk: SPL (fingerprint queries)
CLI → CLI: diff + rank (severity × weight)         ← deterministic
CLI → KV: recall top-3 similar past drifts          ← deterministic
CLI → Qwen: ranked diffs + past incidents
Qwen → CLI: summary + hypothesis + SPL             ← LLM only here
CLI → KV: save new drift record

By the time Qwen sees the request, the data is already structured,ranked, and accompanied by precedent. The LLM’s job is narration,not detection. That keeps the system:

Reproducible. The same window always produces the same top diffs.
Cheap. One Qwen call per investigation, not per data point.
Debuggable. When a hypothesis is wrong, you can inspect theranked diffs and decide whether the diff engine or the LLM was theweak link.

We’ll come back to this in post 4.

Why on top of Splunk?

Three pragmatic reasons:

Most SREs already have it. Anchor doesn’t ship a new database;it uses KV Store, which ships with Splunk. No Lambda, no VPC, noextra monthly bill.
SPL is already the lingua franca for “show me events with theseshapes in this window”. The fingerprint extractor infingerprint.pyis, fundamentally, five SPL queries.
KV Store survives log retention. Your raw logs roll off in90 days; your healthy anchor doesn’t.

We’ll look at how a single anchor capture call becomes one row inKV Store in post 2.

The hackathon framing (briefly)

Anchor was built for the Qwen Cloud × Splunk hackathon’s MemoryAgenttrack. The track asks for four specific properties:

Track-1 requirement	Anchor implementation
Persistent memory	three KV Store collections, nightly OSS backups
Accumulates experience	`apply_feedback()` mutates `signal_weights`
Better decisions across sessions	`diff_all()` ranks by `severity × weight`
Timely forgetting	`decay_weights()` pulls weights halfway to 1.0 every 30 days
Bounded recall under context limit	`recall_similar_drifts()` returns top-3

Posts 3 and 5 dig into the math behind two of those — timelyforgetting and bounded recall.

What you’ll get from the rest of the series

Post 2 — how five SPL queries become aFingerprint object and one KV row.
Post 3 — the diff engine and thedecay-toward-1.0 trick that lets the agent forget on a schedule.
Post 4 — what we send to Qwen,what we get back, and why JSON-mode + low temperature beat free-text.
Post 5 — the optional --deepfunction-calling planner, with a real transcript.
Post 6 — three commands to bringthe backend up on Alibaba Cloud ECS, with OSS backups.

Why I Wrote My Paper in Typst Instead of LaTeX

Jian Feng — Mon, 08 Jun 2026 01:00:00 GMT

The QMJ-TSX working paper is written in Typst,not LaTeX. I expected this to be a minor technical choice and gotmildly surprised by how much it changed the writing experience.This is a short post on what I gained, what I gave up, and when I’dmake the same choice again.

The short version

Typst is a modern typesetting system in the LaTeX tradition: sameproblem (turn marked-up text into a beautiful PDF), same audience(scientific writing), incomparably better tooling underneath. Itcompiles in milliseconds instead of seconds, the error messagespoint at the line you actually wrote, and the source file lookslike Markdown with a stricter dialect rather than like a 1980smacro language.

For a solo working paper with a handful of tables, figures, andreferences, the answer is: just use it.

What I gained

Compile speed. typst compile paper/main.typ runs in roughly100ms on this project. typst watch recompiles on save with nohuman-perceptible latency. Writing with a live preview pane next tothe source file is the closest I have ever come to “writing a paperfeels like writing code.” I edited a sentence, glanced right, kepttyping. That feedback loop matters more than I’d have guessed.

Readable source. The paper/main.typ driver fits on a screen.A section file looks like this:

= Results 

== Replication baseline (AQR QMJ-Canada)
...the Sharpe ratio is $0.64$. The maximum drawdown is $-37.0%$.
A Carhart-style four-factor regression... yields a monthly alpha of
$0.70%$ ($t = 4.46$).

Compare to the same content in LaTeX, where the equivalent prose isinterrupted by \section{}, \label{}, \textbf{}, $\backslash$ ,and \% everywhere. Typst’s = for headings, *bold*, and bare $math$ get out of the way of the words.

Error messages that point at the line you wrote. LaTeX errorsare famously cryptic because the macro expansion has alreadyhappened by the time the compiler complains. Typst errors say“there is a problem on line 47 of results.typ, here’s the offendingtoken.” This is not a small thing when you are debugging at 11pmthe night before a deadline.

A real module system. paper/main.typ just #includessections/*.typ and tables/*.typ. Each table lives in its ownfile. There is no \input weirdness, no preamble bloat, no fragile\newcommand resolution order. The project structure mirrors what Iwould do in any other codebase.

Native programmability without TeX-flavoured pain. Typst is areal expression language. Generating a parameterised table or acaption from a value is #let x = 0.64 ... #x rather than\def\x{0.64} ... \x. I don’t lean on this much in this paper, butthe headroom is there if I want to wire numerical outputs from thePython pipeline straight into the paper later.

Single binary, no distribution. brew install typst and you’redone. No 4 GB MacTeX install, no tlmgr update --self --all, nofighting with which TeX distribution shipped which version of whichpackage. The Makefile‘s make paper target is one line.

What I gave up

Citation styles. Typst’s bibliography system handles Chicagoauthor-date out of the box (which is what I use), but if yourjournal requires an obscure custom .bst file, you may still wantLaTeX. Less of an issue for working papers than for journalsubmissions.

Journal templates. Many journals provide LaTeX templates and noTypst equivalent. Not a constraint for a working paper hosted on apersonal site, but if you are submitting to JFE on day one, thismatters.

Ecosystem maturity. TikZ, pgfplots, and the long tail of LaTeXpackages have no Typst equivalent yet. For a paper with simpletables and externally generated PDF figures (as mine is), thisnever bit me. For a paper that relies on intricate in-documentdiagrams, your mileage will vary.

Should you switch?

A short decision tree:

Working paper, preprint, personal site, blog series: switch.The compile speed alone changes how you write.
Thesis, technical report, internal document: switch. Samereasons. The tooling pays back the migration cost within aweek.
Journal submission to a venue that mandates a LaTeX template:stay with LaTeX for the final submission. You can still draft inTypst and port at the end if the prose-to-typesetting ratiojustifies it; for shorter papers it usually doesn’t.

For QMJ-TSX, the calculus was easy: working paper, hosted onGitHub, regenerated by make paper as part of the pipeline, nojournal constraints. Typst was strictly better on every axis Icared about.

One thing that surprised me

I write more in Typst than I did in LaTeX, because revising ischeaper. A revised sentence in LaTeX implies a 2–5 second compileand possibly a chain of \ref warnings to chase. A revised sentencein Typst is invisible — the preview updates as you type. The costof editing collapsing toward zero changes how willing you are torewrite a paragraph.

That alone is probably worth the switch.

Source for the paper:paper/main.typ on GitHub.Compile with typst compile paper/main.typ or make paper.

What yfinance Survivorship Does to a TSX Small-Cap Backtest

Jian Feng — Mon, 08 Jun 2026 01:00:00 GMT

In the QMJ-TSX paper I flagyfinance survivorship as a likely contaminant of the paper-Qextension’s results. This post unpacks what that actually means,why it’s worse for small-caps than for large-caps, and what anhonest researcher with a free-data constraint can do about it.

The short version: if you build a TSX small-cap universe from“tickers that still trade today” and then backtest over 2011–2025,your historical universe is silently missing every delisting,every reverse-merger reshuffle, and every junior-resource zero.Whatever strategy you test will look better than it would have onthe real cross-section that was investable in 2011.

What yfinance gives you

yfinance is a Yahoo Finance scraper. For any ticker thatcurrently has a Yahoo page (e.g. XYZ.TO for TSX, XYZ.V for TSXVenture), it returns historical daily OHLCV back to the listingdate. That is genuinely useful and has the unbeatable property ofbeing free.

What it does not give you:

Delisted tickers. A ticker that traded on TSX in 2014 anddelisted in 2017 — for any reason — generally has no Yahoopage today. yfinance returns “no data” or silently skips.
Reverse-merged or renamed tickers without a careful symboltrail. A junior that was acquired, reverse-merged, or rolledinto a SPAC becomes effectively invisible.
Point-in-time index membership. “The TSX small-cap universein 2014” is not a thing yfinance can tell you. The best you cando is “tickers in the small-cap bucket today that have datagoing back to 2014,” which is exactly the survivorship trap.

For US large-caps, the gap between (1)–(3) and reality is smallenough that yfinance is a defensible free data source. For TSXsmall/mid-caps it is structurally large.

Why small-caps are the worst case

Three reasons compound:

Base rate of delisting is high. Junior energy, mining, andbiotech names — which dominate the TSX small/mid-cap universe — failat rates that have nothing in common with S&P 500 attrition. Auniverse drawn from “what survived” is not a random subsample of“what was investable”; it is the right tail.

Index reconstitutions are large and frequent. The TSXsmall/mid-cap bucket has meaningful turnover every year. Even ifyou somehow had a perfect snapshot today, projecting it backwardimplies an unrealistic constancy of membership.

The risk being measured is asymmetric. This is the one thatmatters most for a Quality / Safety-style strategy. A “lowvolatility” name that quietly delisted in 2018 doesn’t show up inyour backtest’s loss distribution. The survivors you do test oninclude disproportionately many names whose volatility was lowbecause they didn’t blow up. Your backtest gets the reward ofdefensiveness without paying the tail cost. The whole construct’spremium comes from avoiding tail events, so this is exactly theworst place to have survivorship.

What it does to paper-Q specifically

The paper-Q long-short on TSX small/mid-caps over 2011-12 to 2025-11has a full-sample Sharpe near zero. My honest assessment is that thetrue number — on a survivorship-corrected universe — is probablyworse, not better, for a structural reason:

A defensiveness-tilted long leg benefits most from removing theworst-performing junior resource and biotech names.
yfinance survivorship removes exactly those names from theuniverse.
So the long leg of paper-Q is the part most contaminated bysurvivorship; the short leg less so.
Removing survivorship would tend to hurt the long leg’s measuredSharpe and improve the short leg’s.

Net direction on a long-short: roughly negative. The null resultlikely understates how badly the strategy actually performs.

That is an uncomfortable thing to say in a paper, which is why Isay it. The pre-COVID +0.47 Sharpe is the one I trust least onthis account: the bull run of 2011–2019 generated a lot of namesthat quietly disappeared by 2025 and are missing from my universetoday.

What you can do with a free-data constraint

Two practical interventions worth the effort, two not worth it:

Worth it.

Snapshot the universe at multiple historical dates if youcan. Archive.org and historical TSX/TMX bulletins sometimespreserve old constituent lists. Even three or four historicalsnapshots, used as additional “as-of” universes, expose howmuch the surviving-today list misses.
Report the cross-section size over time. If your 2011“universe” has the same 109 tickers as your 2025 universe, youare not running a 2011 backtest — you are running a 2025backtest on 2011 prices. Putting the cross-section count on achart per month makes this visible to readers.

Not worth it for a working paper.

Don’t fake a survivorship correction. Without delistingprices and dates, you cannot impute returns for missing tickershonestly. A made-up −90% return on a delisted name isresearch fraud in a coat.
Don’t buy CRSP / Compustat just for this. If you can,great. But the right move for a free-data paper is to flagthe limitation honestly and design the conclusions around it,not to pretend you have data you don’t.

The honest framing

The paper’s framing reflects the constraint. The headline claim isnot “paper-Q doesn’t work on TSX small-caps.” It is “in auniverse constructed from currently-listed TSX small-caps withfree price data, a fundamentals-free price proxy fails to recoverthe AQR QMJ-Canada premium, and the failure is concentrated inthe post-COVID low-volatility unwind.” Every clause in thatsentence is true on the data I actually have.

If you are doing a free-data quant project: name your datalimitations specifically, and design your claims so they survivethe limitations being real. That is the cheap version ofacademic honesty, and it is much more useful than a confidentclaim built on data that can’t support it.

Universe file:data/raw/universe/tsx_smallcap.csv.A companion post onhow that universe was builtcovers the construction in more detail.

Building a Free-Data Canadian Small-Cap Universe: 109 Tickers, Three Sources, Zero Subscriptions

Jian Feng — Mon, 08 Jun 2026 01:00:00 GMT

This is the last post in the QMJ-TSX series. It’s themost operational of the lot: how I assembled the data layer for aCanadian small-cap factor paper using only free, public sources, andwhat that constraint forced me to accept.

The pitch is simple. Three data sources, all free, all on thepublic internet, all cacheable as parquet:

Prices. Yahoo Finance via yfinance (.TO for TSX, .V forTSX Venture).
AQR factor series. AQR Datasets— QMJ + BAB Equity monthly, all countries including Canada.
Fama–French factors. Ken French data library— Developed 5-factor + Developed momentum, monthly.

And one hand-curated universe file: 109 TSX small/mid-cap tickersin data/raw/universe/tsx_smallcap.csv.

That’s it. No Bloomberg, no Refinitiv, no CRSP, no Compustat, noS&P Capital IQ. Whether that is sufficient depends on what youwant to claim — which is the rest of the post.

The three sources, briefly

yfinance for prices

1 2	import yfinance as yf df = yf.download("SU.TO", start="2011-01-01", interval="1mo")

Cached to data/raw/prices/{ticker}.parquet. Monthly is sufficientfor a factor paper at this horizon and keeps the cache tiny(~1 MB total for 109 tickers). The .TO suffix is mandatory forTSX names; .V for Venture. Without the suffix you’ll silentlyget the US listing of a same-symbol unrelated company, which isits own special failure mode.

Honest limitations:

Survivorship (seethe previous post).
Adjusted close handling is yfinance’s, not yours. For monthlyrebalanced long-shorts this is fine; for intraday strategies itis not.
Some thinly-traded .V names have suspect prints. The 109-tickeruniverse was filtered partly on data sanity.

AQR datasets for the benchmark

AQR publishes country-level QMJ and BAB series as monthly CSVs.The QMJ-Canada column is the benchmark for the entire paper:without it I have nothing to replicate against. The CSV layout isstable across releases — date column, country columns, one row permonth — and the file is small enough to vendor underdata/raw/aqr/.

The replication gate (Sharpe within 0.30 of AFP 2019 Table II) isdefined against this series. The cross-check against Ken Frenchexists precisely because both benchmark series are public anddisagree slightly on what “the Canadian factor cross-section” is.

Ken French for the cross-check factor library

The Developed 5-factorplus Developed momentum, monthly. Canada is too small a market tohave its own French-style factor library, so Developed-region isthe standard substitute when cross-checking a Canadian series. TheRMW (operating profitability) factor in this library is the onethat QMJ-CAN should load on if the construct is intact; in thepaper it loads at β = +0.61 (t = 4.16). That cross-check is whatturns the replication from “the numbers approximately match” into“the construct approximately matches.”

The universe file

data/raw/universe/tsx_smallcap.csv is a hand-curated list of 109tickers spanning the TSX small/mid-cap range circa late 2025. Theconstruction principles, in rough order:

Currently listed on TSX or TSX Venture with a Yahoo ticker.This is the survivorship-introducing step. There is no freealternative.
Market cap roughly in the small/mid range. No hardcutoff — TSX small-cap definitions vary across providers and Iwasn’t going to invent one. Names that were unambiguouslylarge-cap (the big banks, the integrated energy majors) wereexcluded.
At least ~10 years of monthly data in yfinance. This iswhat bounded the sample to 2011-12 onward. Newer listings wereexcluded so that the cross-section per month was reasonablystable.
Sector diversity within what TSX actually is. Which is tosay: a lot of energy and materials, some industrials, sometech and healthcare, very little consumer. The universereflects the index’s actual sectoral skew rather than fightingit.
Sanity-check on prices. Tickers with obvious datapathologies in yfinance (long flat stretches, single-printspikes, missing months) were dropped at curation time.

None of those steps is forecast-aware — none of them requiredpeeking at returns. But step 1 is the survivorship door, and I’mupfront about it in the paper.

Why no fundamentals?

The AQR QMJ construction uses gross profitability, accruals,leverage, payout ratios — accounting fundamentals at point-in-timefidelity for the entire cross-section. The free-data options forCanadian small-caps are:

SEDAR+ filings. Available, but unparsed and inconsistent.Parsing PDF financial statements at production quality for 100+small-cap tickers is a separate paper’s worth of engineering.
Yahoo Finance fundamentals. Available via yfinance, butpoint-in-time-incorrect (the values are as-restated, notas-reported on the original filing date). Using them wouldintroduce look-ahead bias.
SimFin / EOD historical data. Coverage of TSX small-caps isthin and gappy. I checked.

The cost of getting fundamentals right for this universe isroughly an order of magnitude greater than the cost of doingeverything else combined. That is what drove the entire paper-Q(“price-based proxy”) detour: the negative result on paper-Q isalso a measurement of how far you can get without paying thatcost. Answer: not as far as you’d hope.

What it costs to do this right

If I were redoing this with a budget, the priority order would be:

Point-in-time Canadian fundamentals (Compustat, FactSet,or equivalent). Unlocks the actual AFP construction. Highestmarginal value.
Delisting prices and dates. Kills the survivorship caveatfrom the previous post.Second-highest marginal value.
Index constituent histories (S&P/TSX SmallCap or TMXequivalent, monthly). Lets the universe be reconstitutedpoint-in-time rather than as a hand-curated snapshot.

You can publish a credible free-data paper without (1)–(3),provided you scope the claims to what the data actually supports.That is what this project tried to do.

Closing the series

If you’ve read all eight posts: thank you, that’s more attentionthan most academic papers get. The whole project — paper, code,data manifests, blog series — is atgithub.com/faketut/qmj-tsx.make all regenerates the paper from a clean clone. The pullrequest template is open if you want to extend the universe, swapin a better data source, or rerun the per-component decompositionon a different market.

The single sentence I’d leave you with, across the whole series:a pre-registered null result on a free-data universe, with thedecomposition that explains the null, is a more honest researchcontribution than a positive result you can’t reproduce.

Series index: README.md.

PCA as a Diagnostic, Not a Rescue

Jian Feng — Sun, 07 Jun 2026 01:00:00 GMT

This is the third post in the QMJ-TSX series. Theprevious postshowed that four of the five components of my paper-Q composite aremechanically the same low-volatility signal, all flipping signpost-COVID; the fifth (rolling Sharpe) behaves differently. A naturalfollow-up is: just throw PCA at it.

This post is about what that actually does — and what it doesn’t do.

The premise

If your components are near-collinear, equal-weighting them is wrongboth in theory (it understates the effective number of bets) and inpractice (it dilutes whichever component is actually unique). Thedisciplined fix is to project the components onto an orthogonal basisand price each axis separately.

Concretely: stack the five-component panel as a $(\text{date} \times\text{ticker}) \times 5$ matrix, take the principal components, andrun each PC as its own long-short.

What PCA finds

PC	Variance explained	Interpretation
PC1	60%	Roughly uniform positive loadings across all five components — a clean general-defensive axis.
PC2	22%	A contrast between rolling Sharpe (−0.80) and the beta-flavoured direction (+0.53). The momentum-vs-low-vol split that the horse race already surfaced.

This is the encouraging part. PCA recovers exactly the structure theper-component decomposition already suggested: one dominant low-volfactor, plus a second axis that is essentially “rolling Sharpe minusthe rest.” The five-dimensional design space is reallytwo-dimensional, and the two dimensions have economicinterpretations.

What PCA does not find

Run each PC as a standalone VW tercile long-short, same 10 bps costmodel:

Signal	Full Sharpe	Pre-COVID	Post-COVID
PC1 (low-vol axis)	−0.23	+0.34	−0.85
PC2 (momentum-vs-low-vol)	−0.11	−0.14	−0.08
2-PC composite (EW)	−0.16	—	—
Unorthogonalised paper-Q	+0.03	+0.47	−0.60

Three things to notice.

One. PC1 reproduces the regime flip cleanly. Pre-COVID +0.34,post-COVID −0.85. This is the smoking gun for theprevious post’sclaim that the post-pandemic break is a one-factor phenomenon, notan artefact of equal-weighting correlated proxies. When you collapsethe five proxies onto their dominant common axis, the regime storybecomes more visible, not less.

Two. PC2 has no pricing content. Sharpe is essentially zero inevery subperiod. So the +0.32 full-sample Sharpe of standalonerolling Sharpe from the horse race was in fact largely riding itspositive correlation with the PC1 low-vol axis. Once thatcorrelation is purged, the residual rolling-Sharpe-minus-low-volcontrast does not price on its own in this universe.

Three. Equal-weighting the two PCs underperforms theunorthogonalised composite. This is the expected consequence of(1) and (2): averaging a pricing axis with a non-pricing axisdilutes signal. The “clean” orthogonal composite is worse than thenaive average it was supposed to fix.

The takeaway

PCA didn’t rescue paper-Q. What it did was something more usefulfor an honest paper: it collapsed the post-COVID story into a singledimension. The TSX small/mid-cap low-volatility long-short broke in2020 and has not recovered. That is a cleaner, more falsifiableclaim than “an equal-weighted composite of five price-derived Safetyproxies has a null Sharpe.”

Two generalisable lessons:

PCA can explain a strategy without saving it. If yourcomponents are collinear, PCA tells you what the underlyingdimensions are. Whether those dimensions price is an empiricalquestion PCA cannot answer for you. Don’t confuse “I nowunderstand my signal” with “my signal works.”
Don’t equal-weight PCs either. Same trap as equal-weightingraw components, one level up. If PC1 prices and PC2 doesn’t, youwant PC1, not their average. Variance-explained is not a proxyfor pricing content.

The honest version of “throw PCA at it” is: use PCA as adiagnostic for what the signal actually is, then make a separate,deliberate decision about which axes (if any) to trade.

All PCA outputs, loading tables, and PC long-short Sharpes are inthe paper’s robustness section and regenerate from make robust.Code: github.com/faketut/qmj-tsx.

How I Made a Quant Paper Reproducible in `make all` Under a Minute

Jian Feng — Sun, 07 Jun 2026 01:00:00 GMT

The QMJ-TSX project has a hard constraint baked into the design: afresh clone, on a normal laptop, with no subscriptions, shouldregenerate every number in the paper — and the paper PDF itself — inunder a minute. This post is about how the project meets thatconstraint and why it was worth treating reproducibility as a designparameter rather than a politeness.

The acceptance test

git clone https://github.com/faketut/qmj-tsx
cd qmj-tsx
uv sync
make all

If, at the end of make all, paper/main.pdf exists and itsheadline numbers match the version on GitHub, the project isworking. That is the acceptance test, and CI enforces it. Everythingbelow is in service of keeping that loop short and unambiguous.

The four pieces

1. `uv` for the Python environment

uv replaces pip + venv + pip-tools with a single fast resolver.uv sync reads pyproject.toml and uv.lock, builds a hermeticvenv, and is done in seconds on a warm cache. There is norequirements.txt, no Conda, no Docker. Two reasons this matters:

A reader who is bouncing off your repo will not install Conda orDocker to read your paper. They will close the tab.
A locked resolver means the numbers I report today will stillresolve to the same library versions in two years. That is thewhole point of a lock file.

2. `make` as the command surface

The Makefile is the canonical entry point:

Target	Produces
`make data`	Cached parquets: prices, AQR benchmarks, Ken French FF5+UMD
`make signals`	paper-Q monthly panel
`make backtest`	Long–short returns + summary
`make robust`	Headline sweep, sector-exclusion, per-component horse race
`make figures`	`paper/figures/cumret.pdf`
`make paper`	`paper/main.pdf` (via Typst)
`make all`	All of the above
`make test`	Unit + invariant tests

make is not glamorous, but it is the lowest-common-denominatorbuild tool. Everyone has it. Targets compose. Failed targets stopthe pipeline at the failure site, which is exactly what you wantfor a research build.

3. A typed CLI surface, not notebook cells

Underneath make, every step is a qmj subcommand:

qmj data prices                   # yfinance monthly parquets
qmj data benchmarks               # AQR QMJ/BAB-Canada
qmj data ken-french               # FF5-DEV + UMD-DEV
qmj replicate                     # AQR QMJ-CAN baseline + FF5 cross-check
qmj signals paper-q               # price-based Quality panel
qmj backtest                      # long–short portfolio + summary
qmj robust                        # weighting × buckets × subperiod × cost sweep
qmj figure cumret                 # cumulative-return figure

The CLI exists so that every number in the paper has adeterministic, single-command provenance. The number for thepost-COVID Sharpe came from qmj robust, not from a notebook cell Iran in some order I can no longer remember. Notebooks are great forexploration and terrible for archival. Promote anything you intendto cite into a CLI command.

4. Parquet caches under `data/`

Raw downloads (yfinance prices, AQR CSVs, Ken French ZIPs) land indata/raw/ and are gitignored. Processed monthly panels areparquet under data/processed/. Steps downstream of data arefully offline. Two practical wins:

make all after make data runs in seconds because nothingre-hits the network.
A future reader whose internet is broken (or whose data sourcehas rotted) can still reproduce everything from the releasedparquet bundle.

The paper compiles too

The paper is in Typst (paper/main.typ + paper/sections/*.typ +paper/tables/*.typ). make paper runs typst compile on it andproduces paper/main.pdf. There is no separate “build the paper”ritual disconnected from “build the numbers.” The same make allthat regenerates the backtest also re-compiles the paper that citesthe backtest. (More on the Typst choice in alater post.)

What this buys you

Three things that compound:

Reviewers and readers can verify you. Anyone who suspects anumber can reproduce it without asking me a single question.That is — and this is the dirty secret of empirical finance — farfrom the default.
Future-you can extend without archaeology. Six months fromnow, when I want to add a new robustness cell, I add a CLIsubcommand and a make target. I do not re-derive what paper-Qwas.
The repo is its own demo. A hiring manager reading theREADME sees the acceptance test and either runs it or doesn’t.Either way the bar is concrete.

What I would skip if you’re starting from scratch

Don’t bother with Docker for a project this size. uv pluspinned Python in pyproject.toml is enough.
Don’t ship notebooks as primary deliverables. Ship a CLI and amake target. A notebook can be a demo of the CLI; it cannot bethe canonical source of any number that ends up in your paper.
Don’t over-engineer the data layer. Parquet files in a flatdata/processed/ directory, named after what produced them. Nodatabase. No DVC. You can graduate to those when the datasetoutgrows a laptop.

The whole pipeline is maybe 1,500 lines of Python plus a hundredlines of Typst. The point isn’t that the project is small — it’sthat reproducibility doesn’t require it to be large.

Repo: github.com/faketut/qmj-tsx.The acceptance test is make all after uv sync.

Pre-Registering Replication Gates for a Solo Quant Project

Jian Feng — Sun, 07 Jun 2026 01:00:00 GMT

The QMJ-TSX paper has two findings: a successful replication of theAQR QMJ-Canada series, and an unsuccessful extension to aprice-based proxy on TSX small-caps. The reason I’m comfortablepublishing the negative result is that I wrote down the bar for“success” before I ran the test.

This is what people mean by pre-registration. In medicine andpsychology it is a formal mechanism; in solo quant work it isusually just a habit, and a rare one. This post is the case foradopting it even when nobody is forcing you to.

The two gates

I committed to two numerical gates in writing before the analysis:

Gate 1 — Replication tolerance. The Sharpe ratio of myrecomputed AQR QMJ-Canada series, over the comparable sample, mustfall within ±0.30 of the 0.65 figure reported in AFP (2019) Table IIfor Canada.

Outcome: replicated Sharpe = 0.64. Within tolerance. Pass.

Gate 2 — Calibration of the extension. The Spearman rankcorrelation between my fundamentals-free paper-Q long-short and theAQR QMJ-Canada series, over the common sample, must be ≥ 0.3 for meto claim paper-Q “captures the same construct.”

Outcome: contemporaneous correlation = −0.03, regression β = −0.08(t = −0.38), R² ≈ 0. Fail.

These are not p-values. They are pre-committed numerical bands onthe actual quantities the paper is making claims about. Setting themin advance is the whole point.

Why this matters more for a solo project, not less

The standard argument for pre-registration is to defend againstresearcher degrees of freedom — the small choices (sample window,weighting, winsorisation, sector exclusions) that, taken together,let you nudge a borderline result into significance. In a teamsetting there is at least social friction against this. In a solosetting there is none. You can rerun any cell any number of times,and the only person who would notice is you.

A pre-registered gate creates artificial friction. Once it iswritten down, moving it requires you to consciously admit you aremoving it. That is a low bar, but it turns out to be a meaningfulone.

The asymmetry that makes nulls publishable

There is a respectable version of “my strategy didn’t work” and adisrespectable one. The disrespectable version reads:

I tried a bunch of variants. None of them were significant. I’mcalling that a negative result.

The respectable version reads:

I committed in advance to the following falsifiable test.The test failed in this specific way. Here is what we learn fromthe failure.

Only the second version contains information. The first isindistinguishable from a strategy that almost worked, dressed up ashumility.

For the paper-Q work, gate 2 is what makes the null informative.The pre-committed claim — “if a price-based proxy captures the sameunderlying Quality construct, rank correlation with thefundamentals-based version should be at least 0.3” — is thething being tested. The observed correlation of −0.03 is not “smalland inconclusive.” It is “comprehensively below the bar I set.” Thatis publishable evidence about the limits of fundamentals-freeproxies, not a strategy I am still fishing for.

How to actually do it, lightly

Solo pre-registration does not require ritual. A few things thatworked for me:

Write the gates into the project plan, not just into yourhead. I keep them in memories/session/plan.md with timestamps.Anything not in writing didn’t happen.
Make the gates numerical. “Reasonable replication” is not agate; “Sharpe within ±0.30 of the published number” is.
State the consequence in advance. “If gate 2 fails, thepaper’s claim shifts from ‘paper-Q recovers QMJ’ to ‘paper-Qfails to recover QMJ, and here is the per-componentdecomposition that tells us why.’” The fallback analysis is partof the pre-registration, not a post-hoc rescue.
Don’t tune the gate to the data. The temptation is real. Ifthe observed Spearman is 0.18 and your gate was 0.3, the answeris “fail,” not “0.15 is fine, actually.”
Report the result against the pre-committed gate in thepaper. Not just the number — the comparison to the bar.

That is the whole methodology. Five sentences in a markdown fileand a discipline about not editing them after the fact.

A second-order benefit

A pleasant side effect of gate 2 failing is that theper-component decompositionand the PCA analysisbecame the most interesting parts of the paper. If the gate hadpassed I would have written a competent replication-plus-extensionpaper that nobody would have cared about. Because it failed, I wasforced to ask why it failed — and the answer (“a one-factorpost-COVID low-vol unwind that the composite was masking”) is thegeneralisable finding.

Pre-registration didn’t just protect me from a soft positiveresult. It pointed me at the real one.

Paper, gates, and the full results table:github.com/faketut/qmj-tsx.

When a Famous Anomaly Refuses to Travel: QMJ on TSX Small-Caps

Jian Feng — Sat, 06 Jun 2026 01:00:00 GMT

The Quality Minus Junk (QMJ) factor of Asness, Frazzini, and Pedersen(2019) is one of the better-documented anomalies of the past decade:high-quality firms — profitable, growing, safe, well-managed — earnpersistently higher risk-adjusted returns than low-quality firms across24 developed markets. AQR even publishes the monthly QMJ-Canada serieson its datasets page, so theheadline is independently verifiable by anyone with a spreadsheet.

What AQR does not publish is the underlying long-short on TSXsmall-caps. That universe is where I wanted to deploy the strategy —and the fundamentals AQR uses (gross profitability, accruals, leverage,payout ratios) are not free at the coverage or point-in-time fidelitythe construction requires.

So I asked a narrower question: can a price-derived proxy recover theQMJ premium on TSX small-caps? This post is the headline answer.Spoiler: no, and the way it fails turns out to be more interestingthan a clean replication would have been.

Step 1: replicate what we can replicate

Before extending anything, the replication gate. Using the public AQRQMJ-Canada series (1989-07 to 2026-03, 441 monthly observations):

Statistic	Value
Annualised excess return	8.6%
Annualised volatility	13.4%
Sharpe	0.64
Max drawdown	−37.0%
Carhart-CAN 4-factor monthly α	0.70% (t = 4.46)
→ annualised α	≈ 8.8%

The Sharpe falls within 0.30 of the 0.65 reported in AFP 2019 Table IIfor Canada — comfortably inside my pre-registered tolerance. As anexternal cross-check, regressing the same series on Ken French’sDeveloped FF5 + momentum panel keeps α positive and significant(0.52%/month, t = 3.00) and produces the predicted loading on theprofitability factor RMW (β = +0.61, t = 4.16). The construct isintact. The published premium is real. Replication gate passed.

Step 2: the extension that doesn’t work

To deploy on TSX small-caps without fundamentals, I built paper-Q —a fundamentals-free quality proxy from five price- and return-derivedcomponents, sign-aligned to AFP’s Safety leg:

idiosyncratic volatility,
market beta,
maximum drawdown,
rolling Sharpe,
downside semi-deviation.

Cross-sectionally z-scored, equal-weight composited, value-weightedtercile long-short, monthly rebalance. 109-ticker hand-curated TSXsmall/mid-cap universe. Sample 2011-12 to 2025-11 (168 months).

Headline:

Statistic	Value
Annualised gross return (VW)	+1.0%
Annualised volatility	30.6%
Sharpe (VW)	0.03
Sharpe (EW)	−0.33
Avg. monthly leg turnover	7.4%

The key diagnostic — does paper-Q capture the same construct asAQR QMJ? — is also clean and disappointing. Regressing paper-Q onQMJ-CAN gives β = −0.08 (t = −0.38), R² ≈ 0, contemporaneouscorrelation −0.03. My pre-registered calibration gate (Spearmanρ ≥ 0.3) is not met. A Carhart-CAN regression of paper-Q itselfproduces an insignificant α (t = 0.26).

The price-derived proxy, in this universe, is essentiallyuncorrelated with fundamentals-based Quality. Falsification.

Why the null is the result

A null that you pre-registered against is a different object froma null you stumbled into. I committed in advance to a tolerance bandon the replication Sharpe and a calibration floor on thepaper-Q-vs-QMJ-CAN correlation. The replication passed; theextension failed. That is publishable evidence about the limits offundamentals-free proxies in resource-heavy small-cap universes,not a strategy I’m now going to fish for.

There are at least three plausible mechanisms behind the failure:

Sectoral contamination. Junior energy and mining namesdominate the TSX small-cap universe. The “low-volatility” legof any price-based Safety proxy ends up holding defensives whoserisk is structurally distinct from operational Quality.
Accounting inputs that don’t have price analogues. Accrualsand payout ratios depend on balance-sheet flows whose priceproxies are dominated by sector exposure.
Survivorship in the free data. yfinance only shows me namesthat still trade — likely biasing toward winners and blunting anydefensive premium. (Separate post coming on this.)

What’s actually interesting

The full-sample null masks a clean regime break around COVID:

Period	Annualised return	Net Sharpe
2011-12 → 2020-02	+14.3%	+0.47
2020-03 → 2025-11	−18.1%	−0.60

That flip is what the next two posts in this series are about. Asector-exclusion cut (dropping Energy + Materials) only recoversabout a third of the post-COVID damage — so this is not purely aresource-sector story. A per-component decomposition shows thatfour of the five paper-Q components are essentially the samelow-volatility signal in different statistical clothing, and theyall turned over together. That’s the real finding hiding inside thecomposite, and it is what I think generalises beyond this paper.

Paper, code, and reproducible pipeline:github.com/faketut/qmj-tsx.make all regenerates every number above in under a minute on amodern laptop.

A Low-Vol Unwind Hiding Inside a Composite Signal

Jian Feng — Sat, 06 Jun 2026 01:00:00 GMT

This is the second post in a series on a price-based Quality factor(“paper-Q”) I built for TSX small-caps. The headline result — thatpaper-Q does not recover the AQR QMJ-Canada premium — is inthe previous post. Here I wantto talk about what I found when I cracked the composite open.

If you build any composite signal by averaging z-scored components,the most boring failure mode is also the one that’s easiest tooverlook: your components secretly all measure the same thing. Anull result on the composite then tells you nothing about whetherthe underlying construct works — it just tells you that the averageof N copies of one signal is, surprise, that signal.

That is essentially what happened to paper-Q.

The setup

paper-Q averages five sign-aligned, z-scored components:

idiosyncratic volatility,
market beta,
maximum drawdown,
rolling Sharpe,
downside semi-deviation.

Each is supposed to be a proxy for some part of AFP’s Safety leg.Four of them — idio vol, beta, max drawdown, downside semi-dev — arevolatility-flavoured. The fifth, rolling Sharpe, is the only onewith a price-momentum flavour.

To see which components were actually driving the composite, I ran aper-component horse race: each component as a standalonevalue-weighted tercile long-short, same 10 bps round-trip cost, threewindows (full sample, pre-COVID, post-COVID).

The result

The pattern is sharp.

Component	Full Sharpe	Pre-COVID	Post-COVID
Idiosyncratic vol	low / negative	+ (0.12 to 0.57 band)	− (−0.53 to −0.92 band)
Market beta	low / negative	+	−
Max drawdown	low / negative	+	−
Downside semi-dev	low / negative	+	−
Rolling Sharpe	+0.32	+0.66	−0.10

Two things jump out.

One. Four of the five components are mechanically the sameunderlying signal — the cross-section of price volatility, viewedthrough slightly different statistics. All four post positivepre-COVID Sharpes between roughly +0.12 and +0.57. All four collapseto comparable negative numbers post-COVID, between −0.53 and −0.92.This is one factor turning over, not four independent signalscoincidentally agreeing.

Two. Rolling Sharpe — the only price-momentum-flavoured component— behaves qualitatively differently. It has the highest full-sampleSharpe in the set (+0.32), the highest pre-COVID number (+0.66), andthe shallowest post-COVID drawdown (−0.10).

So the composite’s full-sample ≈ 0 Sharpe is, mechanically, theaverage of one positive signal and four highly correlated negativeones. Equal-weighting masked the heterogeneity entirely.

Why this matters beyond paper-Q

The narrow conclusion is about this strategy: the post-COVID failureof paper-Q is specifically a low-volatility unwind, not a genericbreakdown of price-based signals on TSX small-caps. Everyvolatility-flavoured price statistic in my set turned over togetherin March 2020 and has not recovered; the price-momentum componentwas comparatively unaffected.

The general conclusion is about signal construction. Two practicalrules of thumb that I’d defend more strongly now than I would havebefore this exercise:

Always run components standalone before you composite them.The cost is N extra backtests. The benefit is that you find outbefore publication whether you have one factor or N factors.Equal-weighting near-collinear components is not “diversification” —it’s just a noisier version of the underlying signal, with a worsestory attached.
Decompose first, then design the weighting. If four of fivecomponents turn out to be one factor, the right composite weightsthem by uniqueness, not equally. PCA residuals are the obviousnext move — and the next post in this series will work throughwhat PCA actually does and does not buy you here. (Short version:it explains the regime break cleanly, but it doesn’t rescue thestrategy.)

A meta-lesson

A composite signal with five inputs and a null full-sample resultlooks like a clean negative finding. It almost wasn’t. The cleannegative finding is the per-component table above. Without it, Iwould have written a paper claiming “fundamentals-free Qualitydoesn’t work on TSX small-caps” when what I had actually shown was“a particular equal-weighted low-vol composite doesn’t work on TSXsmall-caps, in a way that says nothing about rolling Sharpe.”

Decomposition is cheap. Run it.

Code and the full robustness battery (including the per-componenttable above): github.com/faketut/qmj-tsx.Reproduce with make robust.

Beating the reference compiler by 5×: a WLP4 → ARM64 optimization journey

Jian Feng — Sun, 17 May 2026 19:00:00 GMT

TL;DR. A four-pass compiler for the CS241 teaching language WLP4, targeting a restricted ARM64 subset. After a handful of focused codegen tricks and a CI loop that measures every push, our .com outputs come out −79.6% smaller than the course’s reference compiler wlp4c across a 65-program benchmark — every single test is smaller, ranging from −63% (heap-intensive) to −92% (pure arithmetic).
What follows is the engineering log, not just the numbers: which optimizations actually mattered, which I deliberately didn’t do, and why the development loop is structured the way it is.

Setting and constraints
The starting point
Phase 1 — Building a safety net
Phase 2A — Diagnostics that survive contact with reality
The optimizations that earned their keep
Phase 5 — Measuring like we mean it
Phase 4 — CMake + a real driver
What I deliberately did not do
Lessons
Appendix: full benchmark table

1. Setting and constraints

WLP4 is a very small C-flavored language used in a university compilers course:

Two scalar types: long and long*.
Entry point is wain(a, b), not main.
Procedures, locals, if/else, while, *p / &x, new[], delete[], println, putchar, getchar.
No early return, no for, no structs, no globals.

The target is a restricted ARM64 subset — only a curated handful of instructions are accepted by the course emulator (bin_ref/arm64emu): essentially add/sub/mul/smulh/umulh/sdiv/udiv, cmp, b/b.cond/br/blr, ldur/stur with 9-bit signed immediates, and a PC-relative ldr xN, imm. No mov, no movz/movk, no ldp/stp, no register-immediate add. Constants come exclusively from a PC-relative literal pool. This sounds annoying but it’s actually the source of most of the size win — see §5.3.

The “oracle” we’re racing against is bin_ref/wlp4c, the canonical course compiler. Both produce the same .com file format (header + program + relocation/import/export footer), both go through the same assembler bin_ref/linkasm, both are run under the same arm64emu. So wc -c program.com is a clean apples-to-apples comparison.

2. The starting point

Before this session, the compiler was already past “naive”. The previous commit (6a9ed5d wlp4gen: trivial-leaf frame elision + tail-call optimization) had two big wins in place:

Trivial-leaf frame elision — procedures with no locals, no ¶m, no calls, and whose body is a single return expr; skip prologue/epilogue entirely. Just compute expr, br x30.
Tail-call optimization — return f(...) reuses the caller’s frame.

What was missing was:

A regression net I could trust before changing codegen.
Hard numbers on whether the optimizations were actually paying off.
A way to develop on macOS without manually round-tripping every test through a Linux VM (the course tools are x86-64 ELF only).

So the work split into two threads: infrastructure first, then talk about codegen.

3. Phase 1 — Building a safety net

Commit a6f4a75. 12 new test programs, a portable test runner, a GitHub Actions workflow.

The expanded corpus was deliberately picked to cover the parts of wlp4gen most likely to silently regress:

Parameter-count boundary cases (four_args, six_args, eight_args) — the ARM64 calling convention uses x0–x7 for the first 8 args; the 9th lives on the stack. The code path that spills overflow params is exercised exactly once in normal usage; without an explicit test it’s easy to break.
Pointer arithmetic (ptr_arith_sub, ptr_ptr_sub) — long* − long* returns the element distance, not the byte distance. Three different sites need to agree on that fact.
Heap with loops (alloc_loop) — exercises the runtime init/new/delete imports plus the linker.
Nested calls (nested_call) — call-saves around argument evaluation.

The portable runner

scripts/run-tests.sh does one branch at the top:

1
2
3

if [[ "$(uname -s)" == "Darwin" ]]; then
  exec colima ssh -- bash -lc "cd '$ROOT' && bash scripts/run-tests.sh"
fi

On macOS, it re-exec’s itself inside colima (a small Lima VM running x86_64 Ubuntu) so the course’s Linux binaries Just Work. On Linux (CI), it runs directly. This costs ~5 seconds of VM ssh overhead per local run, and zero in CI. No code duplication, no environment matrix, no flaky cross-compilation.

One subtle bug I hit later

A few hours in I started seeing intermittent failures — different tests would fail on each invocation. Classic race. The fix was a one-line change in Phase 2A:

1
2
3

# Per-invocation scratch dir so concurrent runs don't clobber each other.
TMP=$(mktemp -d)
trap 'rm -rf "$TMP"' EXIT

The original script used hardcoded /tmp/got.wlp4ti, /tmp/our.com, etc. Fine for serial runs, broken the moment two colima ssh sessions overlap (which happens whenever an editor agent kicks off a verification while a previous one is still draining). Lesson: even for a “one-developer test script”, mktemp -d is one line for an unbounded amount of debugging avoided.

4. Phase 2A — Diagnostics that survive contact with reality

Commit c254eac.

The original scanner printed ERROR: unexpected character. Type errors printed ERROR: type mismatch. Useless once your program is more than 20 lines.

The fix in src/wlp4scan.cc is mechanical but worth describing because the shape of the change matters:

size_t line_no = 1;
size_t lineStart = 0;
auto advancePos = [&](size_t k) {
    for (size_t i = 0; i < k; ++i) {
        if (input[pos + i] == '\n') {
            ++line_no;
            lineStart = pos + i + 1;
        }
    }
    pos += k;
};

Two state variables, one lambda. Every pos += longest became advancePos(longest); whitespace and comment skips update inline. Now errors say ERROR scan at line 14 col 9: unexpected '@'. The point is: this is a non-invasive instrumentation. The scanner’s hot path got one extra if (input[pos+i] == '\n') per character — negligible — and the rest of the file is untouched.

For wlp4type, the trick was a single file-scoped static string g_curProc; that gets set at the top of each for (Node* proc : procedureNodes) iteration. Every existing reportError(detail) site now produces ERROR type in foo(): without touching the dozens of error sites individually. Surgical change > rewrite.

5. The optimizations that earned their keep

Now the meaty part. The codegen in src/wlp4gen.cc is 1300 lines. Here are the ideas that actually moved the benchmark needle, in approximate order of contribution.

5.1 Trivial-leaf frame elision (pre-existing)

For a procedure like:

1	long add(long a, long b) { return a + b; }

Reference compiler emits ~30 instructions: prologue with sub sp, stur x29, stur x30, body, epilogue with restores, br x30. We emit:

1
2
3

Padd:
  add x0, x0, x1
  br x30

Two instructions. The frame setup is dead weight when the function has no locals, no &p, no calls, and the body fits the pattern return expr; Walking the AST once at the top of emitProcedure to check this is cheap and unlocks a 90%+ size win on any small leaf — which is the majority of WLP4 test programs.

This was inherited from the prior commit, but I want to flag it because the benchmark would be ~−40% instead of ~−80% without it. The biggest optimization is the one you can avoid emitting code for entirely.

5.2 Tail-call optimization (pre-existing)

return f(args) reuses the caller’s frame: jump to f with b Pf instead of blr Pf; br x30. Combined with §5.1, recursive functions like fact end up tight loops.

5.3 The per-procedure literal pool with dedup

This is the most architecturally interesting piece, because the ARM64 subset forces it: there’s no immediate-form add or mov. You cannot say add x0, x0, #4. You can only add register to register. So every numeric constant has to come from memory, loaded via PC-relative ldr xN, imm.

The reference compiler’s approach: every time you need a constant, emit a 5-instruction sequence (ldr xN, 8; b 12; .8byte K; ...) that hops over an inline 8-byte literal. Three uses of 4 → three inline literals → 60 bytes.

Our approach in finalizeLiteralPool:

void emitLoadLitPayload(int reg, const string& payload) {
    auto [it, inserted] = payloadToId.try_emplace(payload, idToPayload.size());
    if (inserted) idToPayload.push_back(payload);
    string tag = fmt("PFIX", it->second, "!");      // sentinel for patching
    fixups.push_back({tag, payload});
    emit(fmt("  ldr x", reg, ", ", tag));
}

Each unique constant gets one .8byte slot at the end of the procedure, and every load is a single ldr xN, whose offset gets patched in finalizeLiteralPool after we know the final layout. The patching is a simple two-pass linear scan: first pass records the byte address of every emitted line, second pass replaces the PFIXn! tag with the computed signed offset.

Concrete impact: a 5-call program with 4 used 5 times costs us 8 bytes for the slot + 5 × 4 bytes for the ldr = 28 bytes. The reference: 5 × ~20 bytes = 100 bytes. And literal pools amortize across the whole procedure, so the savings compound with size.

This single mechanism is, I’d estimate, half of the total benchmark win.

5.4 Constant folding in `isConst`

The arithmetic chain templates in the generated corpus show the most extreme ratio: arith_chain_8 is −91.6% smaller. Why?

1
2
3

long wain(long a, long b) {
  return (((((((28 * 41) - 18) + 38) * 23) + 49) * 50) + 12);
}

isConst walks the expression tree and folds + − × ÷ % into a single literal. Combined with §5.1 (trivial-leaf elision), the entire procedure becomes:

Pwain:
  ldr x0, 8
  br x30
  .8byte

Six bytes of useful code. The reference compiler builds the full AST evaluator at runtime: load 28, load 41, mul, …, one constant-load + arithmetic chain per operator. For an 8-operand chain that’s ~50 instructions of dead work.

Note that isConst only folds when both operands are themselves constant — it doesn’t try partial evaluation, it doesn’t reorder for associativity, it doesn’t fold across pointer types. The simple cases handle 90% of opportunities.

5.5 Parameter / local promotion to callee-saved registers

emitPrologue checks: if the procedure uses ≤ 9 named values (params + locals) and never takes & of any of them, all of them get assigned to x19..x27 instead of stack slots. The epilogue only saves/restores the registers actually used. The frame, if no other reason exists for it, gets a smaller belowFpBytes.

This is the single optimization most likely to break things — register allocation has to stay consistent across calls (save before, restore after) and across if/while branches. The way I keep it tractable: a single regTab: id → reg map per procedure built during prologue, consulted everywhere a local is read/written, and no further changes to the rest of the codegen. Either an id is in regTab (use the register) or it’s not (use frame offsets). One state machine, no per-statement bookkeeping.

6. Phase 5 — Measuring like we mean it

Commit 55a2576 added tools/bench.sh. Three things matter about how it’s structured:

It runs the full compile + link pipeline on both sides, then wc -c the resulting .com files. Both go through the same linkasm, so footer/header overhead cancels out — the delta is the program section.
It re-execs into colima on macOS automatically, same trick as the test runner. Zero friction to run locally.
It generates CSV rather than a pretty table, so I can pipe it into anything (the GitHub Actions step posts a summary to $GITHUB_STEP_SUMMARY and uploads the raw CSV as a downloadable artifact).

Commit 2350dc4 added tools/gen_random_wlp4.py: 8 parametric templates (arithmetic chains, local sums, if-ladders, while-sums, multi-arg procs, nested calls, pointer walks, recursive fib) seeded deterministically. This grew the benchmark from 25 hand-written tests to 65 programs.

The data

Corpus	Files	Ours (bytes)	Reference (bytes)	Delta
Hand-written	25	7,212	36,668	−80.33%
Hand + generated	65	19,588	95,976	−79.59%

The −80% holds steady when corpus size and shape change. That’s the validation I wanted before claiming the result generalizes.

Picking apart a single program

For arith_chain_4:

	Ours	Reference
`.com` total bytes	124	1,328
Literal pool bytes (our side)	32	—
Reduction	−90.7%

124 bytes is essentially: ARMCOM header (20) + 6 instructions (24) + an aligned literal slot (8) + footer (~70). The compiler is at the floor; the remaining bytes are format overhead.

One amusing failure of the generator

My first cut of t_recursive produced:

long f(long n) {
    if (n <= 0) { return n; }
    else { return f(n - 1) + f(n - 2); }
}

…which is invalid WLP4. The grammar requires exactly one trailing return per procedure body, never inside if/else. I caught it because the parser flagged 7 out of 40 generated programs as unexpected token 'RETURN' (#15) in state 131. Three-line fix using a result variable; the generator now emits valid WLP4 100% of the time.

Lesson: a noisy parser is a feature, not a bug. If you can’t tell what’s wrong from the error, your generator/optimizer/refactor will silently swallow problems for hours. The Phase 2A line:col work paid for itself on the first non-trivial use.

7. Phase 4 — CMake + a real driver

Commit ebd2f47. The repo had build-toolchain.sh (4 invocations of g++) which was fine — but for anyone running the project from an IDE that has CMake integration, it was friction. Adding a minimal CMakeLists.txt was 30 lines of cmake + a WLP4_WERROR option for CI. The shell script stays as the zero-dependency fast path.

The bin/wlp4 driver is more interesting. It does:

1	wlp4 [-S \| -c] [-o OUT] SRC.wlp4

with the macOS routing trick for -c:

if [[ "$uname_s" == "Darwin" ]] && command -v colima >/dev/null 2>&1; then
    colima ssh -- bash -lc "cat > /tmp/.wlp4-in.asm && \
        '$ROOT/bin_ref/linkasm' < /tmp/.wlp4-in.asm" < "$asm_tmp" > "$out"
else
    # native path
    "$LINKASM" < "$asm_tmp" > "$out"
fi

bin/wlp4 -c test/procedures/proc.wlp4 now Just Works on either host. This is not a big feature, but it removed a per-test mental tax that was discouraging quick experimentation.

8. What I deliberately did not do

Equally important. The original plan had nine work items; only six landed. Here’s what was cut and why.

Self-implemented `linkasm` / `binasm` / `linker-striparmcom`

The pitch: own the entire toolchain instead of vendoring Linux binaries from bin_ref/. The cost: ~1k–1.5k LoC of reverse-engineering, with no formal spec for the assembler syntax. The doc docs/armcom.txt is 60 lines and covers only the binary .com format, not the input language to the assembler. Every ARM64 mnemonic the codegen emits would need a hand-rolled encoder, validated byte-for-byte against the reference.

The benefit: native macOS testing without colima. Real value, but not on the critical path for any user-visible improvement. Skipped.

Parse-table extraction (constexpr arrays)

src/parse_tables.h embeds the LR tables as giant raw string literals; wlp4parse re-tokenizes them at startup. Replacing with constexpr arrays would shave the parser binary by ~30 KB and save a few ms of startup. Skipped: zero impact on any benchmark, full impact on the risk of breaking the parser on a .wlp4i shape we don’t have a test for.

Further wlp4gen micro-optimizations (dead-branch elimination, register-resident `i = i + 1`)

The benchmark is already at −80%. The remaining headroom is in patterns that essentially don’t occur in real WLP4 programs:

if (1 == 1) — nobody writes this; the constant-folded test never fires.
while (0) { ... } — same.
i = i + 1 collapsed into a single add — only saves cycles, not bytes, and only when i is already in a register. Maybe 1–2% on tight loops if I’m careful about correctness.

Risk-adjusted, these are negative-EV. Calling them out as “deferred until there’s a real driver” rather than secretly skipping them.

The discipline

Karpathy’s behavioral guidelines say: don’t add abstractions for one-time operations; don’t refactor code that isn’t broken; every changed line should trace to the user’s request. Applied to compiler work: don’t add an optimization that won’t show up on a benchmark you’ve already built. The benchmark is the success criterion. If it doesn’t move, the optimization didn’t happen.

9. Lessons

A few generalizable things, in order of how often I had to re-learn them:

Build the measurement before the optimization. I had Phase 1’s CI + Phase 5’s benchmark before I touched any codegen this session. Every subsequent decision had a number attached. The −80% headline is only meaningful because I can point at the script and the corpus that produced it.
A safety net plus a noisy error message ≈ unlimited iteration budget. Phase 1 (regression tests) and Phase 2A (line:col diagnostics) combined cost about 2 hours and saved an unknowable but large amount of debugging time. The flaky-tests episode in §3 would have been hours of head-scratching without line:col confirming the scanner was producing identical output on retry.
Surgical > rewrite. The scanner diagnostics change is two state variables, one lambda. The type-pass change is one static string. The test runner /tmp race fix is TMP=$(mktemp -d). Each ships in a commit with a clear blast radius. Compare against the alternative of “while we’re in there, let’s refactor”.
Restrictive targets force good architecture. The ARM64 subset has no immediate-form arithmetic. That’s annoying for a one-shot translator but it forces the literal-pool design, which then gives you dedup almost for free, which then gives you most of the size win. The constraint was the optimization.
Know when to stop. Six commits in, three planned items remained, all with the same property: high effort, negligible benchmark impact, real regression risk. The right call is to write up the work and put down the keyboard, not to continue grinding for marginal numbers. That’s this blog post.

10. Appendix: full benchmark table

See docs/benchmark.csv for the raw 65-row table. The columns:

name — program identifier (test file basename)
our_bytes — wc -c of wlp4{scan|parse|type|gen} | linkasm output
ref_bytes — wc -c of wlp4c output
delta_bytes = our_bytes − ref_bytes (negative is smaller)
delta_pct = 100 × delta_bytes / ref_bytes
our_pool — bytes in our literal pool (8 × count of .8byte lines)
ref_pool — left at 0 (we don’t have the reference’s intermediate asm)

Top 5 wins (smaller is more dramatic):

name	our_bytes	ref_bytes	delta_pct
arith_chain_8	124	1,472	−91.58%
arith_chain_7	124	1,436	−91.36%
arith_chain_4	124	1,328	−90.66%
arith_chain_3	124	1,292	−90.40%
wain_ptr	140	1,240	−88.71%

Bottom 5 (smallest wins, where overhead matters most):

name	our_bytes	ref_bytes	delta_pct
alloc_loop	724	1,968	−63.21%
alloc_basic	548	1,572	−65.14%
nested_call	468	1,592	−70.60%
recursive	404	1,584	−74.49%
eight_args	396	1,640	−75.85%

The pattern is clean: arithmetic-heavy programs benefit most (constant folding + tiny pools), heap and many-arg programs benefit least (linker-pulled alloc.com, mandatory parameter spilling for 8+ args). Every test in between sits in a tight band around −80%.

Reproducing

git clone https://github.com/faketut/C-class-compiler.git
cd C-class-compiler
./build-toolchain.sh
bash scripts/run-tests.sh        # 25/25 should pass
bash tools/bench.sh > docs/benchmark.csv
# On macOS, install colima first: brew install colima && colima start --arch x86_64

The benchmark is deterministic (gen_random_wlp4.py is seeded). The CSV will reproduce byte-for-byte across runs.

Windows-only desktop app, macOS-friendly contributors

Jian Feng — Sun, 17 May 2026 13:00:00 GMT

GhostPilot is a Windows-only app. The stealth-overlay trick (click-through, capture-resistant) only works with SetWindowDisplayAffinity and friends. System-audio loopback uses WASAPI. The whole pitch is Windows-specific.

And yet I develop on macOS.

The repo runs end-to-end on my MacBook — minus the OS-specific overlay tricks — within seconds of git clone. The CI matrix is Windows-only. Nothing is faked. Nothing is mocked at the architecture level.

Here’s the discipline that makes that possible. It’s three rules.

The shape

flowchart TB    subgraph Core[Cross-platform: 95% of the code]        LLM[LLMEngine]        RAG[RAGManager]        ASR[ASR client]        UI[Qt overlay logic]        REC[Session recorder]    end    subgraph Platform[Platform-specific shims]        Audio[audio_capture.py
WASAPI / sounddevice]        Win[windows_api.py
overlay tricks]        Path[Path helpers
roaming vs Library vs .config]    end    Core --> Audio    Core --> Win    Core --> Path    Audio -.via env.- Sounddev[sounddevice
mac/Linux mic]    Audio -.via env.- WASAPI[WASAPI loopback
Windows system audio]

Cross-platform code doesn’t know which platform it’s on. Platform-specific code is small, named, and isolated.

Rule 1: one capability, one backend abstraction

The audio module is the canonical example:

# src/audio_capture.py
BACKEND = os.environ.get("AUDIO_BACKEND", "auto")

def make_capturer() -> AudioCapturer:
    backend = BACKEND
    if backend == "auto":
        backend = "wasapi" if sys.platform == "win32" else "sounddevice"
    if backend == "wasapi":
        from src._audio_wasapi import WasapiLoopbackCapturer
        return WasapiLoopbackCapturer()
    if backend == "sounddevice":
        from src._audio_sounddevice import SoundDeviceCapturer
        return SoundDeviceCapturer()
    raise ValueError(f"unknown AUDIO_BACKEND: {backend}")

Both WasapiLoopbackCapturer and SoundDeviceCapturer implement the same async interface:

1
2
3

class AudioCapturer(Protocol):
    async def start(self) -> AsyncIterator[bytes]: ...
    async def stop(self) -> None: ...

Calling code never branches on platform. It calls make_capturer() and iterates. The factory is the only place that knows.

flowchart LR    Caller[ASR client] --> F[make_capturer]    F -->|win32| W[WasapiLoopbackCapturer]    F -->|darwin/linux| S[SoundDeviceCapturer]    W --> I[bytes async iterator]    S --> I    I --> Caller

The win: when a macOS dev tests “does the ASR pipeline work end-to-end?”, they export AUDIO_BACKEND=sounddevice, speak into the mic, and the entire app runs. They lose system-audio loopback (a Windows-specific feature requiring BlackHole on Mac) but they can test 100% of the application logic. Iterations don’t require a Windows VM.

Rule 2: every “where do I store this?” question goes through one function

OS file conventions are different and silent. Get it wrong once, your app writes to ~/Documents/GhostPilot/ on Windows and $APPDATA/GhostPilot/ on macOS, and now you have a phantom data directory you’ll find six months later wondering “what is this.”

Centralize:

def _user_prompt_dir() -> Path:
    if sys.platform == "win32":
        base = os.environ.get("APPDATA") or str(Path.home() / "AppData" / "Roaming")
        return Path(base) / "GhostPilot" / "prompts"
    if sys.platform == "darwin":
        return Path.home() / "Library" / "Application Support" / "GhostPilot" / "prompts"
    return Path.home() / ".config" / "GhostPilot" / "prompts"

Every “writable user data” lookup goes through a function like this. Three branches, one place. If I get the macOS convention wrong, I fix it in one file.

The recordings directory does the same. The keyring lookup does the same. The cache directory does the same.

Rule 3: paths inside files are always POSIX

This is the rule I learned the hard way, three Windows CI failures in a row:

# Wrong — works on Mac, breaks on Windows when the file is consumed cross-OS
{"path": str(out.relative_to(self._dir))}
# → "screenshots/0001.jpg" on Mac
# → "screenshots\\0001.jpg" on Windows  ← breaks readers

# Right — same string on every platform
{"path": out.relative_to(self._dir).as_posix()}

Path is platform-aware for interacting with the OS. For serializing (JSONL, config files, anything that might be read on a different machine), normalize to forward slashes. .as_posix() is the magic. Always use it before writing a path string to disk.

graph LR    A[Path object] -->|interact with OS
open, exists, stat| B[Native separator]    A -->|serialize
JSON, YAML, db column| C[.as_posix → forward slash]    C -->|read back on any OS| D[Path parses correctly]

CI matrix: test where you ship, dev where you’re fast

# .github/workflows/ci.yml
strategy:
  matrix:
    os: [windows-latest]
    python-version: ["3.9", "3.12"]

Note: Windows only. I don’t run CI on macOS even though I develop there. Here’s why:

The macOS dev experience is “import + run pytest + ruff.” That’s already verified by my local pre-commit muscle memory.
The thing that breaks on Windows is the platform-specific 5%. Adding macOS to CI would catch nothing the Windows runner doesn’t, and would double my CI minutes.
Python 3.9 + 3.12 matrix matters more than OS matrix. Half the bugs are 3.9 syntax that 3.12 silently accepts (PEP 604 unions, walrus in comprehension, etc.).

The local guard:

1 2	# At the top of every module using new-style hints from __future__ import annotations

This single line makes list[str] | None parse on 3.9 (as a string, deferred). Without it, the import crashes on the Windows runner with a TypeError that doesn’t fire on my Mac because I’m on 3.13.

What the macOS dev gets — and doesn’t

What works on macOS clone-to-running:

All tests (102/102 pass)
Lint, type check
ASR pipeline with AUDIO_BACKEND=sounddevice
LLM streaming, RAG retrieval, session record/replay
Settings UI, prompts editor, sessions viewer

What doesn’t work — and is fine:

The stealth overlay (Windows API)
System-audio loopback (needs BlackHole bridge)
Hotkey listener (Windows-specific implementation)
The build target (PyInstaller spec is Windows-only)

95% of the bugs live in the 95% of the code that’s cross-platform. That’s the part the macOS dev can iterate on. The 5% that’s Windows-only gets manual verification on a Windows VM at release time, not every commit.

The summary

Discipline	What it buys
Backend abstraction with env-var override	macOS dev can run the full app, not a stub
One function per “where does data live?”	No phantom directories, easy to fix when wrong
`.as_posix()` for serialized paths	Recordings replay cross-OS
Windows-only CI, multi-Python matrix	Catch the real bugs (3.9 vs 3.12), skip the fake ones
`from __future__ import annotations` at the top of every module	3.9 keeps parsing

You can build a Windows-only app from a Mac. You just have to draw the line cleanly between what your app does and how it does it on this specific OS. Cross-platform code is the asset. Platform-specific shims are the cost. Keep the ratio honest and the dev loop stays tight.

I made ruff CI-blocking. The whole repo changed 5 lines.

Jian Feng — Sun, 17 May 2026 07:00:00 GMT

Most lint cleanups happen the wrong way:

Run the linter on a mature repo.
See 400 errors.
Disable half the rules.
Lower the severity to “warning.”
Ship continue-on-error: true in CI.
Never look at the warnings again.

Six months later the lint config is a graveyard of disabled rules, the CI step is a placebo, and adding a new rule is a project. The repo never gets cleaner.

Here’s a better path: gradual strictness. Three stages, each cheap, each independently shippable.

The three stages

flowchart LR    S0[No linter] --> S1[Stage 1
continue-on-error
see the truth]    S1 --> S2[Stage 2
fix in batches
still non-blocking]    S2 --> S3[Stage 3
make blocking
tree is clean]    S3 --> S4[New rule?
back to stage 1]

The key insight: the only stage where you tolerate noise is stage 1, and only briefly. The point of stage 1 is to see the real shape of the problem, not to live there.

Stage 1: install, run, collect — 5 minutes

# .github/workflows/ci.yml
- name: Lint
  run: ruff check .
  continue-on-error: true   # explicitly temporary

continue-on-error is doing one job: surfacing the violations in the CI log so you can see them. Not blocking the build. Not silencing them.

# pyproject.toml — start permissive
[tool.ruff]
line-length = 100
target-version = "py39"

[tool.ruff.lint]
select = ["E", "F", "W", "I"]  # errors, pyflakes, warnings, import sorting

That’s the minimum useful ruleset. Add more later.

Stage 2: batch-fix, don’t piecemeal-fix

1 2	ruff check . --output-format=concise # Found 47 errors.

Resist the temptation to fix them one at a time across 47 PRs. Two passes:

1 2	ruff check . --fix # safe autofixes ruff check . # what's left needs eyes

The remaining errors fall into three groups:

Group	Strategy
Real bugs (F841 unused var, F401 unused import)	Fix immediately, often with `--fix`.
Style (E501 line too long, I001 import order)	`--fix` handles 95%.
Disagreements (E741 ambiguous name `l`)	Either rename or add an inline `# noqa: E741` with a reason.

In this repo, stage 2 was one commit:

chore(d): remove dead self.text_client, fix lint, make ruff blocking
- llm_engine: remove dead self.text_client legacy raw-SDK assignment
- main.py: drop unused kb_watch_task local (F841)
- settings_ui.py: hoist pathlib.Path import to module top instead of __import__ trick
- tests: drop unused imports (AsyncMock, Path, pytest)
- .github/workflows/ci.yml: remove continue-on-error from ruff step

7 files, 8 insertions, 16 deletions. Now the lint passes.

Stage 3: flip the switch

# .github/workflows/ci.yml
- name: Lint
  run: ruff check .
  # continue-on-error removed — must pass

This is the entire stage 3 change. One line, deleted.

Now any future violation is a CI failure, surfaced on the PR, before review. The cost of fixing it is 10 seconds (ruff check . --fix). The cost of not enforcing it is unbounded.

Why this works where “enable everything at once” fails

graph TB    A[Enable strict linter on mature repo] --> B[400 errors]    B --> C{Choice}    C -->|Fix all| D[Massive PR
impossible to review]    C -->|Disable rules| E[Lint config rot]    C -->|Ignore| F[Useless CI step]    A2[Gradual strictness] --> B2[Stage 1: see 47 errors]    B2 --> C2[Stage 2: fix in 1 commit]    C2 --> D2[Stage 3: enforce going forward]    D2 --> E2[Adding a new rule? Repeat]

The mature-repo failure mode happens because the team conflates “the code needs to comply with this rule” with “the rule needs to be on right now.” Decoupling them buys you the breathing room to actually fix things.

What to do when adding a new rule

The exact same loop:

Add the rule with continue-on-error: true (or use # noqa on existing violations).
Run the linter, count the new violations.
Fix them in one batch — typically one commit per category.
Remove continue-on-error.

If the count of violations is too high to fix in one batch, the rule is too aggressive for this codebase. Either narrow its scope (per-file-ignores) or don’t add it.

ruff-specific tricks worth knowing

[tool.ruff.lint.per-file-ignores]
# Test files can have unused imports (fixtures) and ambiguous names
"tests/*" = ["E741"]
# Generated migrations
"migrations/*" = ["E501", "I001"]

1 2	[tool.ruff.lint.isort] known-first-party = ["src"]

1 2	# Only fail on rules introduced after this date ruff check --select=ALL --ignore=ANN --ignore=D .

The lesson

A linter that doesn’t fail the build is a comment in a config file. It signals intent without enforcing it, and intent decays. The gradual path lets you go from zero to enforced in three small steps, each one shippable on its own day, without ever blocking the team on a single mega-cleanup.

In this repo: from “no linter” to “ruff CI-blocking on Windows × Py 3.9 + 3.12” was three commits over two days. Zero painful PRs. The repo has stayed clean since.

Hybrid RAG when your corpus has 50 chunks, not 5 million

Jian Feng — Sun, 17 May 2026 01:00:00 GMT

Most RAG content assumes you have a vector DB, an embeddings budget, and tens of thousands of documents. The reality for personal tools, internal wikis, and onboarding bots is different: you have a handful of markdown files and a handful of seconds to retrieve from them.

GhostPilot’s knowledge base is one file: knowledge/resume.md, ~50 chunks after splitting. Retrieval feeds question-answering during interviews. Latency budget: under 200ms. Embedding cost budget: ideally zero per query.

A dense-only retriever would be the wrong tool here. Here’s what actually works.

The retrieval pipeline

flowchart LR    Q[Question] --> S1[BM25
top-k1]    Q --> S2[Dense
top-k2]    S1 --> M[RRF merge]    S2 --> M    M --> R[Re-rank by score]    R --> T[Top-N chunks]    T --> P[Inject into prompt]

Three observations drove this:

BM25 alone catches all the high-recall named-entity queries. “Did you work at Stripe?” — Stripe appears in exactly one chunk. Dense retrieval will return the right chunk plus three near-misses about other fintech experience. BM25 returns one chunk with a huge score gap.
Dense alone catches all the paraphrased queries. “Tell me about a time you mentored someone” — the word mentored may not appear in the resume, but helped junior engineers and led the intern program will embed close.
Either alone gets ~70% of queries right. Together they get ~95%.

Reciprocal Rank Fusion in 5 lines

Forget weighted score combinations — the scores are on different scales and weights are a nightmare to tune. RRF is parameter-free and works:

def rrf(rankings: list[list[str]], k: int = 60) -> list[tuple[str, float]]:
    scores: dict[str, float] = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank)
    return sorted(scores.items(), key=lambda x: -x[1])

k=60 is the canonical choice from the original RRF paper. Larger k flattens the rank decay; smaller k makes top-1 matter more. 60 is fine. Stop tuning it.

Why BM25 still wins on small corpora

graph TB    subgraph Small[Corpus: ~50 chunks]        S1[BM25: precise on named entities]        S2[Dense: catches paraphrasing]        S1 -.equal value.- S2    end    subgraph Large[Corpus: ~5M chunks]        L1[BM25: query needs exact terms]        L2[Dense: handles vocabulary mismatch at scale]        L2 -.dominates.- L1    end

The dense-retrieval orthodoxy assumes the corpus is large enough that vocabulary mismatch is the main failure mode. On a 50-chunk corpus:

Every named entity appears in 1-3 chunks. BM25’s IDF term gives them huge weight. Dense embeddings smear this signal across “semantically similar” chunks.
The query language is close to the document language (you’re paraphrasing your own resume). Dense retrieval’s vocabulary-bridging superpower is underused.
Most importantly: the failure modes are different. Dense fails by returning plausible-but-wrong neighbors. BM25 fails by returning nothing for paraphrased queries. Hybrid covers both.

Cost: ~zero per query

class RAGManager:
    def __init__(self, chunks):
        self.bm25 = BM25Okapi([tokenize(c) for c in chunks])
        # Embeddings computed once at startup, cached on disk.
        self.embeddings = self._load_or_compute(chunks)

    def retrieve(self, query: str, top_n: int = 4):
        bm25_ranking = self._bm25_topk(query, k=10)
        dense_ranking = self._dense_topk(query, k=10)
        merged = rrf([bm25_ranking, dense_ranking])
        return [self.chunks[i] for i, _ in merged[:top_n]]

Per-query cost:

Step	Cost
BM25	`O(
Dense	one embedding API call OR one local model forward (sentence-transformers all-MiniLM ~30ms on CPU)
RRF merge	`O(k)`, microseconds
Total	30-150ms, $0 if local embeddings

GhostPilot uses sentence-transformers/all-MiniLM-L6-v2 locally. 80MB download, no API keys, runs on CPU. For 50 chunks, the embedding computation at startup is the dominant cost (~3 seconds, once). Per-query embedding of the user’s question is ~30ms.

Hot reload because the corpus is small

A 50-chunk corpus rebuilds in <1 second. So instead of a separate ingestion pipeline:

watcher = QFileSystemWatcher([str(knowledge_dir)])
watcher.directoryChanged.connect(
    lambda: loop.create_task(rag_manager.rebuild_async())
)

Edit a markdown file, save, the index updates before you’ve Alt-Tab’d back to the overlay. This is impossible at 5M chunks; it’s trivial at 50.

When to switch to dense-only or a vector DB

flowchart TD    A[Corpus size?] -->|< 1k chunks| B[Hybrid BM25 + dense
in-memory, hot-reload]    A -->|1k - 100k| C[Hybrid + on-disk dense index
FAISS, sqlite-vss]    A -->|> 100k| D[Dedicated vector DB
Qdrant, Weaviate]    A -->|> 10M| E[Sharded ANN + filtering pipeline]

The threshold for needing a vector DB is much higher than the marketing implies. Below ~10k chunks, hybrid retrieval in process beats anything else on latency, cost, and operational complexity combined.

Things that didn’t make the cut

Cross-encoder re-ranker. Tested, added ~150ms and ~3% recall. Not worth it at this scale.
Query expansion. The LLM does this implicitly — adding it in retrieval was redundant noise.
Chunk overlap. Tried 20% overlap, made BM25 fire twice on the same content, hurt RRF. Pure non-overlapping chunks at sentence boundaries won.

TL;DR

For corpora under ~10k chunks:

Use BM25 and dense embeddings. Each catches what the other misses.
Merge with RRF, not weighted scores. Stop tuning weights.
Keep everything in memory. Hot-reload on file change.
Run embeddings locally. The 80MB model is enough.
Resist the urge to add a vector DB. You don’t need it.

The whole rag_manager.py is ~200 lines including the watcher. Total p95 retrieval latency on my machine: 47ms.

One event loop to rule them all: PyQt6 + asyncio in production

Jian Feng — Sat, 16 May 2026 19:00:00 GMT

The Python desktop-app stack has a coordination problem.

Qt wants to own the main thread.
asyncio wants to own the event loop.
Your audio capture wants a background thread.
Your LLM stream wants to be an async generator.
Your hotkey listener wants OS-level callbacks.

GhostPilot runs all five of these in one process, on Windows, without a single deadlock or threading bug. The glue is qasync, plus a small set of discipline rules. Here’s what survived contact with production.

The architecture

flowchart TB    subgraph Main[Qt main thread
= asyncio event loop via qasync]        UI[OverlayUI widgets]        LLM[LLMEngine.generate_*_stream
async generators]        Replay[session_replay tasks]        Watcher[QFileSystemWatcher
RAG hot-reload]    end    subgraph Workers[Background threads]        Audio[sounddevice / WASAPI capture]        ASR[Azure Speech SDK callback]        Hotkey[keyboard hook]    end    Audio -->|asyncio.run_coroutine_threadsafe| Q1    ASR -->|asyncio.run_coroutine_threadsafe| Q1    Hotkey -->|QMetaObject.invokeMethod| UI    Q1[(asyncio.Queue)] --> LLM    LLM --> UI    Replay --> UI    Watcher --> RAG    UI -.uses.-> LLM

Rule 1: everything async runs on the Qt thread. Workers cross the boundary in exactly one of two ways.

qasync in 10 lines

import asyncio
from PyQt6.QtWidgets import QApplication
import qasync

app = QApplication(sys.argv)
loop = qasync.QEventLoop(app)
asyncio.set_event_loop(loop)

with loop:
    loop.run_until_complete(main())  # main() is an async coroutine

That’s the entire integration. qasync is a Qt event loop that also runs asyncio callbacks. await asyncio.sleep(0.1) works, loop.create_task(...) works, and Qt slots fire on the same thread.

The trap: it is one loop on one thread. Anything that tries to spawn a second loop (a library that calls asyncio.run(...) internally, a thread that calls asyncio.get_event_loop()) will explode.

Crossing the boundary, two patterns

Worker thread → asyncio task

1
2
3

def _on_audio_chunk(chunk: bytes) -> None:
    """Called from sounddevice's audio thread."""
    asyncio.run_coroutine_threadsafe(audio_queue.put(chunk), loop)

run_coroutine_threadsafe schedules a coroutine on the loop and returns a concurrent.futures.Future. Critically: it does not block the calling thread. The audio callback returns in microseconds.

def _on_hotkey():
    """Called from the keyboard hook thread."""
    QMetaObject.invokeMethod(ui_asr, "toggle_visibility",
                             Qt.ConnectionType.QueuedConnection)

QueuedConnection posts the call to the Qt event loop. The target slot runs on the Qt thread on the next iteration.

Never touch a QWidget from a non-Qt thread. Even setting widget.setText(...) from a worker thread will crash, eventually, in a way that doesn’t reproduce.

Async generators are the right shape for LLM streams

class LLMEngine:
    async def generate_answer_stream(self, question, ui_queue, *, q_type=None):
        async for delta in self.text_provider.chat_stream(messages, model=self.model):
            await ui_queue.put({"type": "token", "text": delta.text})
        await ui_queue.put({"type": "usage", ...})

The UI consumer is dead simple:

async def ui_updater():
    while True:
        msg = await ui_queue.get()
        if msg["type"] == "token":
            overlay.append_token(msg["text"])
        elif msg["type"] == "usage":
            overlay.flush_footer(msg)

This shape gives you, for free:

Backpressure (the queue fills if the UI is slow).
Cancellation (cancel the producer task, the queue drains, the consumer keeps going).
Recording (a second consumer tees tokens to disk — that’s how the session recorder works).

Cancellation across the boundary

Long-running replays need a Cancel button. The pattern:

# UI thread (Sessions tab)
self._cancel_event = asyncio.Event()  # safe because qasync = same thread

def _on_cancel_clicked(self):
    self._cancel_event.set()

# Replay task
async def _run():
    for i, turn in enumerate(turns, 1):
        if cancel_event.is_set():
            break
        await session_replay.replay_turn(engine, turn)

Crucial detail: the in-flight turn runs to completion. Cancelling it mid-stream would leave a half-finished assistant row in the recording. The cancel is checked between turns, never within them.

This is a general principle: cancellation points are negotiated, not imposed. A “cancel everything right now” is almost always wrong; the producer needs to land on a clean state.

QFileSystemWatcher + asyncio: an unexpected pairing

The RAG knowledge base hot-reloads when a markdown file under knowledge/ changes. Naively, the watcher’s directoryChanged signal fires on the Qt thread — but rebuilding the BM25 index takes ~500ms and would freeze the UI.

1
2
3

watcher.directoryChanged.connect(
    lambda path: loop.create_task(rebuild_kb_async())
)

loop.create_task(...) is the bridge. The Qt slot returns immediately; the rebuild runs as an asyncio task that yields back to the loop between chunks. UI stays responsive. No thread, no lock, no manual QThread boilerplate.

This is the unsung superpower of qasync: the Qt thread is the asyncio thread, so you never need a thread for “non-blocking but long-running” work. Make it an async task and it just yields.

What I will not do again

graph LR    A[Bug] --> B[Thought:
spawn a QThread]    B --> C[Now you have
two threads]    C --> D[Race conditions]    D --> A

I started this project with a QThread per worker. Every interesting bug was a race condition between Qt’s event loop and the worker thread. The rewrite to “one qasync loop, workers are thin shims that marshal back via run_coroutine_threadsafe“ deleted the entire category.

If you reach for QThread, stop and ask: can this work be an async task instead? Almost always, yes.

Recap

Need	Tool
Streaming LLM output to UI	async generator → `asyncio.Queue` → consumer task
OS callback (audio, hotkey) → async logic	`asyncio.run_coroutine_threadsafe`
OS callback → Qt widget	`QMetaObject.invokeMethod(..., QueuedConnection)`
File system watcher → expensive work	qasync slot → `loop.create_task`
Long task cancellation	`asyncio.Event` checked at clean boundaries
Anything else that wants a thread	First try: make it an async task

One loop. One thread. One mental model. The whole app fits in your head.

Your LLM retry loop is probably wrong

Jian Feng — Sat, 16 May 2026 13:00:00 GMT

A bad retry loop is a strict upgrade over no retry loop, until it isn’t.

Here’s the failure mode I shipped, then fixed:

A user rotated their DeepSeek API key. They forgot to update .env. Every LLM call hit 401 Unauthorized. The retry loop dutifully retried each call three times, then failed over to Gemini — which also got 401 because the user had pasted the DeepSeek key into the Gemini field. Total: six network round-trips, six log lines, ~8 seconds of latency, identical result to giving up immediately.

The fix is to classify the error before deciding whether to retry, and before deciding whether to fail over.

Three buckets, not one

flowchart TD    Err[Provider error] --> Q{Classify}    Q -->|fatal
400/401/403/404| F[Raise immediately
no retry, no failover]    Q -->|retryable
408/425/429/5xx, network, timeout| R[Retry in place
with backoff]    R -->|retries exhausted| N[Move to next provider]    Q -->|unknown| N

Three rules carry the whole design:

Bucket	Why this action
fatal — `400/401/403/404`, “Invalid API key”, “Bad request”	Switching providers can’t fix a missing key. Retrying can’t fix a malformed request. Both add latency and burn quota.
retryable — `408/425/429/5xx`, `TimeoutError`, `ConnectionError`	The same provider will probably succeed on retry. Switching providers loses session context (e.g. usage caches). Backoff first, then fail over if the provider is genuinely down.
unknown	Conservative: don’t retry in place (could be fatal), but do try the fallback (the fallback might work).

The classifier is the entire trick

_FATAL_STATUS = {400, 401, 403, 404}
_RETRYABLE_STATUS = {408, 409, 425, 429, 500, 502, 503, 504}

def classify_error(err) -> str:
    # 1) Structured status code from SDK exception
    status = getattr(err, "status_code", None) or getattr(err, "status", None)
    if status is None:
        resp = getattr(err, "response", None)
        if resp is not None:
            status = getattr(resp, "status_code", None)
    if status in _FATAL_STATUS:     return "fatal"
    if status in _RETRYABLE_STATUS: return "retryable"

    # 2) Exception class (network layer)
    name = type(err).__name__.lower()
    if any(s in name for s in ("timeout", "connection", "network")):
        return "retryable"

    # 3) String regex on str(err) — last resort, but worth it
    msg = str(err).lower()
    if re.search(r"\b(invalid api key|unauthorized|bad request)\b", msg):
        return "fatal"

    return "unknown"

A few things to notice:

Status codes are checked first because they’re the most reliable signal.
Exception class is second because network errors don’t carry HTTP status codes.
Regex on the message is last and intentionally narrow. It catches the OpenAI-SDK case where a 401 was wrapped in a RuntimeError with the original message but no status.

Retry with exponential backoff, bounded

async def chat_stream(self, ...):
    for provider in [self.primary, *self.fallbacks]:
        for attempt in range(self.retries_per_provider + 1):
            try:
                async for delta in provider.chat_stream(...):
                    yield delta
                return
            except Exception as e:
                kind = classify_error(e)
                if kind == "fatal":
                    raise
                if kind == "retryable" and attempt < self.retries_per_provider:
                    await asyncio.sleep(self.backoff_base * (2 ** attempt))
                    continue
                break  # move to next provider
    raise  # all providers exhausted

Defaults: retries_per_provider=1, backoff_base=0.5. So a single retryable error costs 0.5s, then moves on. The worst case across two providers is 0.5 + 1.0 + 0 = 1.5s before raising.

One bounded retry per provider is almost always the right default. Two is paranoid, three is hostile.

Mid-stream errors are not retryable

sequenceDiagram    participant Client    participant Provider    Client->>Provider: chat_stream(...)    Provider-->>Client: token "Hello"    Provider-->>Client: token " world"    Provider--xClient: ConnectionError mid-stream    Note over Client: Do NOT restart.
The user already saw "Hello world".

If tokens have already been emitted to the UI, retrying the call would replay them — the user sees Hello worldHello world, my name is.... Worse, vision replays would re-bill the image. The right behavior is to let the error propagate to the UI as a stream interruption.

The implementation:

async def chat_stream(self, ...):
    emitted = False
    try:
        async for delta in self._with_failover(...):
            emitted = True
            yield delta
    except Exception:
        if emitted:
            raise  # do not restart; surface to UI
        # else: failover already tried in _with_failover
        raise

Observability is half the value

Every failover writes a single info event onto the UI queue:

1 2	{"type": "info", "kind": "vision", "provider": "gemini", "note": "failover from openai", "error": "openai: 503"}

The overlay footer shows it for one second. That’s enough for the user to know “DeepSeek is down, you’re on Gemini” without staring at logs.

Test matrix that earned its keep

graph LR    T1[401 from primary] --> A1[no retry, no failover, raise]    T2[429 from primary] --> A2[retry once in 0.5s, succeed]    T3[503 from primary] --> A3[retry, still fail, try fallback, succeed]    T4[ConnectionError mid-stream] --> A4[propagate, no restart]    T5[Unknown exception] --> A5[fail over without retry]

Five tests. Five real production scenarios. Each one used to be a bug.

TL;DR

Classify errors into fatal | retryable | unknown before deciding.
Retry in place at most once with exponential backoff.
Fail over only after retries are exhausted or the error is unknown.
Never retry once tokens have been emitted to the user.
Surface every switch on the UI for one second.

The whole module is ~250 lines. It used to be ~80 and behaved much worse.

FengHub

The Deployment: Splunk + Qwen on Alibaba Cloud in Three Commands

The three-command path

What setup_ecs.sh actually does

What verify_setup.sh checks

What only humans can do

OSS as the durability layer

The three Qwen Cloud surfaces, on the same backend

What’s deliberately not in the demo deploy

Validating the proof

Wrapping the series

The Planner: Function-Calling for SRE Drill-Down

The four tools

What the planner sees

The system prompt

The loop

A real (abbreviated) transcript

Why qwen-max-latest specifically

Safety / robustness details

Tests for an LLM loop

The Narrator: Putting the LLM Only at the Edge

The full prompt, verbatim

What the model sees as input

What the model has to return

Provider abstraction (the small one)

Where the LLM is in the bigger picture

What we deliberately don’t do

The cost shape

The Diff: Ranking Severity by What We've Learned Matters

Three diffs in, ranked list out

The three classes of diff

Volume diff

Template diff

Metric diff

The weights: how Anchor learns

Timely forgetting: weights decay halfway every 30 days

A small operational detail

What this looks like to the engineer

Why this matters for the LLM

The Fingerprint: Turning a Healthy Week into a Row in KV Store

What’s in a fingerprint

Why punct instead of clustering

The trust boundary: SPL injection

From Fingerprint to KV row

What an anchor looks like in JSON

What this buys us

What we didn’t include (and why)

Why a MemoryAgent for on-call

The 2 a.m. problem

The three memories

Where the LLM fits

Why on top of Splunk?

The hackathon framing (briefly)

What you’ll get from the rest of the series

Why I Wrote My Paper in Typst Instead of LaTeX

The short version

What I gained

What I gave up

Should you switch?

One thing that surprised me

What yfinance Survivorship Does to a TSX Small-Cap Backtest

What yfinance gives you

Why small-caps are the worst case

What it does to paper-Q specifically

What you can do with a free-data constraint

The honest framing

Building a Free-Data Canadian Small-Cap Universe: 109 Tickers, Three Sources, Zero Subscriptions

The three sources, briefly

yfinance for prices

AQR datasets for the benchmark

Ken French for the cross-check factor library

The universe file

Why no fundamentals?

What it costs to do this right

Closing the series

PCA as a Diagnostic, Not a Rescue

The premise

What PCA finds

What PCA does not find

The takeaway

What `setup_ecs.sh` actually does

What `verify_setup.sh` checks

Why `qwen-max-latest` specifically

Why `punct` instead of clustering

From `Fingerprint` to KV row

1. `uv` for the Python environment

2. `make` as the command surface

4. Parquet caches under `data/`

5.4 Constant folding in `isConst`

Self-implemented `linkasm` / `binasm` / `linker-striparmcom`

Further wlp4gen micro-optimizations (dead-branch elimination, register-resident `i = i + 1`)