<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>FengHub</title>
    <link>https://faketut.github.io/</link>
    <language>en</language>
    <copyright>All rights reserved 2026, Jian Feng</copyright>
    <lastBuildDate>Thu, 18 Jun 2026 04:39:53 GMT</lastBuildDate>
    <generator>Hexo</generator>
    <atom:link href="https://faketut.github.io/rss2.xml" rel="self" type="application/rss+xml"/>
    <item>
      <title>The Deployment: Splunk + Qwen on Alibaba Cloud in Three Commands</title>
      <link>https://faketut.github.io/2026/06/19/anchor-06-deploy-alibaba-cloud/</link>
      <description>
        <![CDATA[<p>The first five posts were about how Anchor <em>works</em>. This one is about
how to put it on a server that someone other than you can ta]]>
      </description>
      <author>Jian Feng</author>
      <category domain="https://faketut.github.io/categories/engineering/">engineering</category>
      <category domain="https://faketut.github.io/categories/engineering/observability/">observability</category>
      <category domain="https://faketut.github.io/tags/anchor/">anchor</category>
      <category domain="https://faketut.github.io/tags/splunk/">splunk</category>
      <category domain="https://faketut.github.io/tags/sre/">sre</category>
      <category domain="https://faketut.github.io/tags/observability/">observability</category>
      <category domain="https://faketut.github.io/tags/alibaba-cloud/">alibaba-cloud</category>
      <pubDate>Fri, 19 Jun 2026 01:00:00 GMT</pubDate>
      <content:encoded>
        <![CDATA[<p>The first five posts were about how Anchor <em>works</em>. This one is abouthow to put it on a server that someone other than you can talk to.</p><p>The hackathon target is <strong>Alibaba Cloud</strong>: ECS for compute, OSS fordurable memory backups, DashScope for the LLM calls Anchor alreadymakes. The walkthrough lives in<a href="https://github.com/faketut/Anchor/blob/main/deploy/alibaba-cloud.md"><code>deploy/alibaba-cloud.md</code></a>; this postexplains <em>why</em> each piece is shaped the way it is.</p><h2 id="The-three-command-path"><a href="#The-three-command-path" class="headerlink" title="The three-command path"></a>The three-command path</h2><p>Once the console-only prerequisites are done (more on those below),the entire ECS install is:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">ssh root@&lt;ecs-public-ip&gt;</span><br><span class="line">curl -fsSL https://raw.githubusercontent.com/faketut/Anchor/main/deploy/setup_ecs.sh | bash</span><br><span class="line">nano /opt/anchor/.env       <span class="comment"># fill in SPLUNK_PASSWORD, QWEN_API_KEY, OSS_* creds</span></span><br><span class="line">bash /opt/anchor/deploy/verify_setup.sh</span><br></pre></td></tr></table></figure><p>That’s the entire happy path. <code>setup_ecs.sh</code> is idempotent — safe tore-run after editing <code>.env</code>. <code>verify_setup.sh</code> is a pre-flightchecker that exits non-zero on any failure, so it’s usable as ahealthcheck.</p><h2 id="What-setup-ecs-sh-actually-does"><a href="#What-setup-ecs-sh-actually-does" class="headerlink" title="What setup_ecs.sh actually does"></a>What <code>setup_ecs.sh</code> actually does</h2><p>The script (<a href="https://github.com/faketut/Anchor/blob/main/deploy/setup_ecs.sh"><code>deploy/setup_ecs.sh</code></a>)consolidates seven steps:</p><ol><li><strong>OS sanity check</strong> — bail if not root, bail if not Ubuntu.</li><li><strong><code>apt-get install</code></strong> — Docker, Compose v2, git, Python venv.</li><li><strong><code>git clone || git pull</code></strong> — checks out Anchor to <code>/opt/anchor</code>.</li><li><strong><code>docker compose up -d</code></strong> with the Alibaba overlay(<a href="https://github.com/faketut/Anchor/blob/main/deploy/docker-compose.alibaba.yml"><code>deploy/docker-compose.alibaba.yml</code></a>)that:<ul><li>sets <code>restart: unless-stopped</code> so Splunk survives ECS reboots</li><li>binds Splunk Web (8000) to localhost only — accessed via SSHtunnel, never the public internet</li><li>leaves the mgmt API (8089) exposed for the Anchor CLI butrestricted at the security-group layer to your laptop’s IP</li><li>removes HEC (8088) entirely (not needed for the demo flow)</li><li>caps Splunk at 2 vCPU &#x2F; 4 GB so a runaway query doesn’t OOM the box</li></ul></li><li><strong>Wait-for-Splunk loop</strong> — polls <code>https://localhost:8089/services/server/info</code>for up to 120 seconds before continuing. First-boot init is theslowest step; without this wait, the next command fails on a freshinstall.</li><li><strong>KV Store schema install</strong> — copies <code>splunk/collections.conf</code>into the container, chowns it, and restarts Splunk. This is theonly step that requires <code>docker exec</code> choreography; everythingelse is host-side.</li><li><strong>Anchor venv + <code>pip install -e &#39;.[alibaba]&#39;</code></strong> — sets up thePython environment for the nightly OSS backup cron.</li></ol><p>The script ends with a printed checklist of the <em>human</em> next steps:edit <code>.env</code>, run the verifier, schedule the cron.</p><h2 id="What-verify-setup-sh-checks"><a href="#What-verify-setup-sh-checks" class="headerlink" title="What verify_setup.sh checks"></a>What <code>verify_setup.sh</code> checks</h2><p>The verifier (<a href="https://github.com/faketut/Anchor/blob/main/deploy/verify_setup.sh"><code>deploy/verify_setup.sh</code></a>)runs six independent checks and reports each as PASS &#x2F; FAIL &#x2F; SKIP:</p><table><thead><tr><th>Check</th><th>What FAIL means</th></tr></thead><tbody><tr><td><code>.env</code> exists</td><td>You haven’t filled in credentials yet</td></tr><tr><td>Required env vars set</td><td><code>SPLUNK_PASSWORD</code> &#x2F; <code>QWEN_API_KEY</code> placeholders still present</td></tr><tr><td>Splunk mgmt API reachable</td><td>Container not running, or security group blocking</td></tr><tr><td>All 3 KV collections present</td><td><code>collections.conf</code> install failed; re-run <code>setup_ecs.sh</code></td></tr><tr><td>OSS bucket reachable (optional)</td><td>AK pair wrong, or wrong endpoint&#x2F;bucket</td></tr><tr><td><code>anchor list</code> succeeds</td><td>End-to-end smoke test — the CLI ↔ KV path works</td></tr></tbody></table><p>It exits non-zero on any FAIL so you can chain it: e.g. <code>bash deploy/verify_setup.sh &amp;&amp; systemctl restart anchor-cron</code>.</p><p>The “optional” tag on OSS is deliberate. You can run Anchor withoutthe OSS backup; you just lose the durability guarantee. The verifiersays SKIP, not FAIL, when <code>OSS_*</code> env vars aren’t set.</p><h2 id="What-only-humans-can-do"><a href="#What-only-humans-can-do" class="headerlink" title="What only humans can do"></a>What only humans can do</h2><p>There are three things the script <em>can’t</em> automate, because theyrequire the Alibaba Cloud console:</p><table><thead><tr><th>Step</th><th>Why manual</th></tr></thead><tbody><tr><td>Provision the ECS instance</td><td>Account-scoped action, billing implications</td></tr><tr><td>Open security-group ports 22 + 8089 to your laptop IP</td><td>Requires knowing your client IP — different for every dev</td></tr><tr><td>Create the OSS bucket + RAM user with <code>AliyunOSSFullAccess</code></td><td>Account-scoped; RAM user creation needs human review</td></tr></tbody></table><p>The walkthrough in <a href="https://github.com/faketut/Anchor/blob/main/deploy/alibaba-cloud.md"><code>deploy/alibaba-cloud.md</code></a>specifies exact values:</p><ul><li><code>ecs.g7.large</code> (2 vCPU, 8 GB), Singapore or Hong Kong</li><li>Ubuntu 24.04 LTS, 60 GB ESSD</li><li>Security group: TCP 22 + TCP 8089 from your laptop’s <code>/32</code>, that’s it</li><li>OSS bucket: private ACL, <strong>versioning enabled</strong> so backups areimmutable</li></ul><p>The Singapore &#x2F; Hong Kong region choice matters: DashScope’sinternational endpoint has the lowest latency from those regions, andkeeps the ECS-to-Qwen call sub-100 ms.</p><h2 id="OSS-as-the-durability-layer"><a href="#OSS-as-the-durability-layer" class="headerlink" title="OSS as the durability layer"></a>OSS as the durability layer</h2><p>KV Store data lives on the ECS instance disk. ECS disks are durable(triple-replicated), but the <em>blast radius</em> is one instance. If youaccidentally <code>docker compose down -v</code> (the <code>-v</code> removes volumes), youranchors and drift history are gone.</p><p><a href="https://github.com/faketut/Anchor/blob/main/deploy/backup_kv_to_oss.py"><code>deploy/backup_kv_to_oss.py</code></a>solves that with a 60-line script:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">dump_kv</span>() -&gt; <span class="built_in">dict</span>:</span><br><span class="line">    <span class="string">&quot;&quot;&quot;Snapshot all three collections to a single JSON dict.&quot;&quot;&quot;</span></span><br><span class="line">    svc = connect()</span><br><span class="line">    <span class="keyword">return</span> &#123;</span><br><span class="line">        <span class="string">&quot;anchors&quot;</span>:        <span class="built_in">list</span>(svc.kvstore[<span class="string">&quot;anchors&quot;</span>].data.query()),</span><br><span class="line">        <span class="string">&quot;drift_history&quot;</span>:  <span class="built_in">list</span>(svc.kvstore[<span class="string">&quot;drift_history&quot;</span>].data.query()),</span><br><span class="line">        <span class="string">&quot;signal_weights&quot;</span>: <span class="built_in">list</span>(svc.kvstore[<span class="string">&quot;signal_weights&quot;</span>].data.query()),</span><br><span class="line">        <span class="string">&quot;snapshot_at&quot;</span>:    datetime.now(timezone.utc).isoformat(),</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">upload_to_oss</span>(<span class="params">payload: <span class="built_in">dict</span></span>) -&gt; <span class="built_in">str</span>:</span><br><span class="line">    auth   = oss2.Auth(os.environ[<span class="string">&quot;OSS_ACCESS_KEY_ID&quot;</span>],</span><br><span class="line">                       os.environ[<span class="string">&quot;OSS_ACCESS_KEY_SECRET&quot;</span>])</span><br><span class="line">    bucket = oss2.Bucket(auth, os.environ[<span class="string">&quot;OSS_ENDPOINT&quot;</span>],</span><br><span class="line">                                os.environ[<span class="string">&quot;OSS_BUCKET&quot;</span>])</span><br><span class="line">    key = <span class="string">f&quot;anchor-memory/<span class="subst">&#123;datetime.now(timezone.utc).isoformat()&#125;</span>.json&quot;</span></span><br><span class="line">    bucket.put_object(key, json.dumps(payload).encode(<span class="string">&quot;utf-8&quot;</span>),</span><br><span class="line">                      headers=&#123;<span class="string">&quot;x-oss-server-side-encryption&quot;</span>: <span class="string">&quot;AES256&quot;</span>&#125;)</span><br><span class="line">    <span class="keyword">return</span> key</span><br></pre></td></tr></table></figure><p>The wiring: <code>oss2.Auth</code> → <code>oss2.Bucket</code> → <code>put_object</code> withserver-side AES256 encryption. That’s the <em>Alibaba Cloud API usage</em>the hackathon rules want as proof — a single file that imports theAlibaba SDK, authenticates with RAM credentials, and pushes data intothe platform.</p><p>Scheduled via cron:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">0 3 * * * cd /opt/anchor \</span><br><span class="line">  &amp;&amp; set -a; . ./.env; set +a \</span><br><span class="line">  &amp;&amp; .venv/bin/python deploy/backup_kv_to_oss.py &gt;&gt; /var/log/anchor-backup.log 2&gt;&amp;1</span><br></pre></td></tr></table></figure><p>With bucket versioning enabled, every nightly run is preserved.Restore is <code>oss-util cp oss://... ./</code> plus a small replay script thatreads the JSON and writes each row back via <code>kv_insert</code>. The repodoesn’t ship the restore script because nobody needs it for the demo;it’s ~30 lines of Python when someone does.</p><h2 id="The-three-Qwen-Cloud-surfaces-on-the-same-backend"><a href="#The-three-Qwen-Cloud-surfaces-on-the-same-backend" class="headerlink" title="The three Qwen Cloud surfaces, on the same backend"></a>The three Qwen Cloud surfaces, on the same backend</h2><p>The ECS install gets you the CLI. The MCP server and Qwen Custom Skillshare the <em>same</em> Splunk backend — they’re just different transports:</p><table><thead><tr><th>Surface</th><th>How to bring it up</th></tr></thead><tbody><tr><td><strong>CLI</strong></td><td>already installed by <code>setup_ecs.sh</code></td></tr><tr><td><strong>MCP server (stdio)</strong></td><td><code>pip install -e &#39;.[mcp]&#39; &amp;&amp; anchor-mcp</code> — plug into Claude Desktop or Cursor</td></tr><tr><td><strong>Custom Skill (HTTP)</strong></td><td><code>pip install -e &#39;.[skill]&#39; &amp;&amp; uvicorn anchor.skill_server:app</code> — register <a href="https://github.com/faketut/Anchor/blob/main/deploy/qwen_skill/anchor-skill.yaml"><code>deploy/qwen_skill/anchor-skill.yaml</code></a> in Qwen Cloud → Application Center</td></tr></tbody></table><p>That’s by design. The “application” is the SPL + KV layer(<a href="02-fingerprint.md">posts 2</a> and <a href="03-diff-and-weights.md">3</a>); thesurfaces are interchangeable. Adding a Slack bot, a Discord bot, or aPagerDuty webhook is the same pattern: thin transport, call into<code>agent.compare()</code>, render the <code>CompareResult</code>.</p><h2 id="What’s-deliberately-not-in-the-demo-deploy"><a href="#What’s-deliberately-not-in-the-demo-deploy" class="headerlink" title="What’s deliberately not in the demo deploy"></a>What’s deliberately not in the demo deploy</h2><p>A short list of things that would be on the checklist for a realproduction deploy but were intentionally cut for the hackathon:</p><table><thead><tr><th>Skipped</th><th>Why</th></tr></thead><tbody><tr><td>TLS on the mgmt API (Caddy &#x2F; Let’s Encrypt)</td><td><code>SPLUNK_VERIFY_SSL=false</code> is fine for a 30-second judge demo. Documented in <code>alibaba-cloud.md</code> as a real-deploy requirement.</td></tr><tr><td>systemd unit for the cron</td><td>Crontab is fine for a once-a-day backup. Adding systemd adds a unit file with zero behavioral change.</td></tr><tr><td>RAM policy scoped to the single bucket</td><td><code>AliyunOSSFullAccess</code> is broader than necessary. Real deploy should scope to <code>oss:PutObject</code> on <code>anchor-memory/*</code>.</td></tr><tr><td>Multi-AZ Splunk replication</td><td>Single ECS instance is fine for a demo. Splunk SHC is several weeks of work.</td></tr><tr><td>Skill-server behind a reverse proxy</td><td>The skill server has bearer-token auth via <code>secrets.compare_digest</code>, which is the right primitive — but it’s running on <code>0.0.0.0:8000</code>. For real use, front it with Caddy + TLS.</td></tr></tbody></table><p>The principle: ship the simplest thing that demonstrates thecapability. Document the production gaps honestly.</p><h2 id="Validating-the-proof"><a href="#Validating-the-proof" class="headerlink" title="Validating the proof"></a>Validating the proof</h2><p>The hackathon submission asks for two things:</p><ol><li><strong>A URL to a code file demonstrating Alibaba Cloud API usage.</strong> Thatfile is <a href="https://github.com/faketut/Anchor/blob/main/deploy/backup_kv_to_oss.py"><code>deploy/backup_kv_to_oss.py</code></a>.60 lines, imports <code>oss2</code>, authenticates with RAM AK, uploads withserver-side encryption. Direct mapping from the rules to the code.</li><li><strong>Evidence the backend runs on Alibaba Cloud.</strong> The 30-seconddemo video walks: Alibaba Cloud console showing the ECS instance →SSH into it → <code>docker ps</code> showing Splunk → <code>curl https://localhost:8089/services/server/info</code> returning a 200 →OSS console showing the <code>anchor-memory/*.json</code> objects → laptopside <code>anchor compare</code> against the ECS public IP.</li></ol><p>That second list is the checklist at the bottom of<a href="https://github.com/faketut/Anchor/blob/main/deploy/alibaba-cloud.md"><code>deploy/alibaba-cloud.md</code></a>.</p><h2 id="Wrapping-the-series"><a href="#Wrapping-the-series" class="headerlink" title="Wrapping the series"></a>Wrapping the series</h2><p>Six posts in:</p><ul><li><a href="01-why-memoryagent.md"><strong>Post 1</strong></a> — the on-call problem and thethree-memory framing.</li><li><a href="02-fingerprint.md"><strong>Post 2</strong></a> — five SPL queries → one KV row.</li><li><a href="03-diff-and-weights.md"><strong>Post 3</strong></a> — the diff engine and decaytoward 1.0.</li><li><a href="04-narrator-llm-at-edge.md"><strong>Post 4</strong></a> — the LLM only narrates.</li><li><a href="05-planner-react-loop.md"><strong>Post 5</strong></a> — the optional ReAct planner.</li><li><a href="06-deploy-alibaba-cloud.md"><strong>Post 6</strong></a> — the deploy you’re reading.</li></ul><p>The common thread across all six: <em>most of the agent’s value lives inthe deterministic core; the LLM is an edge component</em>. That’s whyranking, recall, decay, and the planner’s tool restrictions allmatter more than the prompt itself.</p><p>If you’d build this differently — or you’ve shipped something similarand want to compare notes — open an issue on<a href="https://github.com/faketut/Anchor/issues"><code>faketut/Anchor</code></a>.</p>]]>
      </content:encoded>
    </item>
    <item>
      <title>The Planner: Function-Calling for SRE Drill-Down</title>
      <link>https://faketut.github.io/2026/06/18/anchor-05-planner-react-loop/</link>
      <description>
        <![CDATA[<p><code>anchor compare</code> does one LLM round-trip and returns a narrative
(<a href="04-narrator-llm-at-edge.md">post 4</a>). That’s eno]]>
      </description>
      <author>Jian Feng</author>
      <category domain="https://faketut.github.io/categories/engineering/">engineering</category>
      <category domain="https://faketut.github.io/categories/engineering/observability/">observability</category>
      <category domain="https://faketut.github.io/tags/llm/">llm</category>
      <category domain="https://faketut.github.io/tags/anchor/">anchor</category>
      <category domain="https://faketut.github.io/tags/splunk/">splunk</category>
      <category domain="https://faketut.github.io/tags/sre/">sre</category>
      <category domain="https://faketut.github.io/tags/observability/">observability</category>
      <pubDate>Thu, 18 Jun 2026 01:00:00 GMT</pubDate>
      <content:encoded>
        <![CDATA[<p><code>anchor compare</code> does one LLM round-trip and returns a narrative(<a href="04-narrator-llm-at-edge.md">post 4</a>). That’s enough for ~80% ofdrift investigations. For the remaining 20% — where the engineerwants the agent to <em>follow a thread</em> — there’s <code>anchor compare --deep</code>.</p><p><code>--deep</code> swaps the single Qwen call for a function-calling ReAct loop:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">thought → tool_call → observation → thought → tool_call → observation → … → final JSON</span><br></pre></td></tr></table></figure><p>The planner has read-only access to four tools. It’s told to preferdepth over breadth and stop early. A hard step cap prevents runawayloops. This post walks through<a href="https://github.com/faketut/Anchor/blob/main/src/anchor/investigator.py"><code>investigator.py</code></a>.</p><h2 id="The-four-tools"><a href="#The-four-tools" class="headerlink" title="The four tools"></a>The four tools</h2><table><thead><tr><th>Tool</th><th>What it wraps</th><th>Why it exists</th></tr></thead><tbody><tr><td><code>recall_similar_drifts(signals, k, min_similarity)</code></td><td><code>memory.recall_similar_drifts</code></td><td>The default first move when a signal feels familiar</td></tr><tr><td><code>get_drift_details(drift_id)</code></td><td><code>memory.get_drift</code></td><td>After recall, read a full past record before relying on it</td></tr><tr><td><code>run_spl(spl, earliest, latest)</code></td><td><code>splunk_client.run_search</code> (capped at 50 rows)</td><td>Evidence-gathering: deploy logs, host breakdowns, audit</td></tr><tr><td><code>list_recent_drifts(limit, outcome)</code></td><td><code>memory.list_drifts</code></td><td>Situational awareness when nothing recalls</td></tr></tbody></table><p>A few principles in that list:</p><ol><li><strong>Every tool wraps deterministic code.</strong> The planner can’t make upSPL that we then execute blind — <code>run_spl</code> goes through the same<code>splunk_client.run_search</code> the diff engine uses, with the same<code>max_count=50</code> cap.</li><li><strong>Read-only.</strong> No tool mutates KV Store. The planner can’taccidentally apply feedback or delete an anchor.</li><li><strong>No “give the LLM Python”.</strong> Sandboxed shell tools are powerfuland dangerous. Four narrow tools beat one wide one.</li></ol><h2 id="What-the-planner-sees"><a href="#What-the-planner-sees" class="headerlink" title="What the planner sees"></a>What the planner sees</h2><p>Initial user payload(<a href="https://github.com/faketut/Anchor/blob/main/src/anchor/investigator.py"><code>_initial_payload</code></a>):</p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="punctuation">&#123;</span></span><br><span class="line">  <span class="attr">&quot;planner_version&quot;</span><span class="punctuation">:</span> <span class="number">1</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;anchor_name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;Healthy Week&quot;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;compare_window&quot;</span><span class="punctuation">:</span> <span class="punctuation">&#123;</span><span class="attr">&quot;start&quot;</span><span class="punctuation">:</span> <span class="string">&quot;...&quot;</span><span class="punctuation">,</span> <span class="attr">&quot;end&quot;</span><span class="punctuation">:</span> <span class="string">&quot;...&quot;</span><span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;initial_summary&quot;</span><span class="punctuation">:</span> <span class="string">&quot;p95 latency tripled and a new PaymentGateway template appeared...&quot;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;initial_hypothesis&quot;</span><span class="punctuation">:</span> <span class="string">&quot;downstream payment-svc degradation&quot;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;top_diffs&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">    <span class="punctuation">&#123;</span><span class="attr">&quot;signal&quot;</span><span class="punctuation">:</span> <span class="string">&quot;...&quot;</span><span class="punctuation">,</span> <span class="attr">&quot;severity&quot;</span><span class="punctuation">:</span> <span class="string">&quot;HIGH&quot;</span><span class="punctuation">,</span> <span class="attr">&quot;delta_pct&quot;</span><span class="punctuation">:</span> <span class="number">299.4</span><span class="punctuation">,</span> <span class="attr">&quot;note&quot;</span><span class="punctuation">:</span> <span class="string">&quot;...&quot;</span><span class="punctuation">&#125;</span></span><br><span class="line">    <span class="comment">// up to 10</span></span><br><span class="line">  <span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;already_recalled&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">    <span class="punctuation">&#123;</span><span class="attr">&quot;id&quot;</span><span class="punctuation">:</span> <span class="string">&quot;7db2d8aa&quot;</span><span class="punctuation">,</span> <span class="attr">&quot;outcome&quot;</span><span class="punctuation">:</span> <span class="string">&quot;resolved&quot;</span><span class="punctuation">,</span> <span class="attr">&quot;similarity&quot;</span><span class="punctuation">:</span> <span class="number">0.71</span><span class="punctuation">&#125;</span></span><br><span class="line">  <span class="punctuation">]</span></span><br><span class="line"><span class="punctuation">&#125;</span></span><br></pre></td></tr></table></figure><p>The “initial” fields come from the regular <code>compare</code> that ran first.The planner builds on that — it doesn’t redo the diff. <code>already_recalled</code>tells it which past incidents the narrator already saw, so it canchoose to dig deeper into one of them or look elsewhere.</p><h2 id="The-system-prompt"><a href="#The-system-prompt" class="headerlink" title="The system prompt"></a>The system prompt</h2><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><span class="line">You are Anchor&#x27;s deep-investigation planner.</span><br><span class="line"></span><br><span class="line">You receive an initial CompareResult: anchor name + top diffs + an</span><br><span class="line">initial narration. Your job is to deepen the investigation using the</span><br><span class="line">tools provided, then return a tighter root-cause hypothesis with an</span><br><span class="line">evidence chain.</span><br><span class="line"></span><br><span class="line">Strategy:</span><br><span class="line">1. If diffs contain a new template or a metric spike, call</span><br><span class="line">   recall_similar_drifts on those signals to find precedents.</span><br><span class="line">2. If a precedent has outcome=resolved with a confirmed_reason, call</span><br><span class="line">   get_drift_details to read its full record before relying on it.</span><br><span class="line">3. If you suspect a deploy/config change, call run_spl against relevant</span><br><span class="line">   indexes (e.g. deploy_log, config_change, audit) within the compare</span><br><span class="line">   window.</span><br><span class="line">4. Stop and finalize as soon as you have a defensible hypothesis. You</span><br><span class="line">   have up to 6 tool calls — prefer depth over breadth and stop early</span><br><span class="line">   when evidence converges.</span><br><span class="line"></span><br><span class="line">Tool observations are clipped at ~8 KB; if you need more, narrow the SPL.</span><br></pre></td></tr></table></figure><p>Four things worth noting:</p><ul><li><strong>A numbered strategy, not free-form</strong> — gives the model a defaultbranching order. It deviates when warranted, but it has somewhereto start.</li><li><strong>Hard cap of 6 tool calls.</strong> Default <code>CONFIG.investigate_max_steps</code>.This is the difference between “agent” and “agent loop until yourQwen bill explodes”.</li><li><strong>Observations capped at 8 KB.</strong> If <code>run_spl</code> returns a giant rowset,it’s truncated and the planner is told to narrow the SPL. Thisprevents one fat tool call from blowing the whole context window.</li><li><strong>“Stop early when evidence converges.”</strong> Counter-instinct for anLLM trained on “be helpful”. Without this, the planner spends all6 calls even when it had the answer after 2.</li></ul><h2 id="The-loop"><a href="#The-loop" class="headerlink" title="The loop"></a>The loop</h2><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">for</span> step_num <span class="keyword">in</span> <span class="built_in">range</span>(<span class="number">1</span>, max_steps + <span class="number">1</span>):</span><br><span class="line">    rsp = client.chat.completions.create(</span><br><span class="line">        model=model, messages=messages,</span><br><span class="line">        tools=TOOLS, tool_choice=<span class="string">&quot;auto&quot;</span>, temperature=<span class="number">0.1</span>,</span><br><span class="line">    )</span><br><span class="line">    msg = rsp.choices[<span class="number">0</span>].message</span><br><span class="line">    messages.append(_serialize_assistant(msg))</span><br><span class="line"></span><br><span class="line">    tool_calls = <span class="built_in">getattr</span>(msg, <span class="string">&quot;tool_calls&quot;</span>, <span class="literal">None</span>)</span><br><span class="line">    <span class="keyword">if</span> <span class="keyword">not</span> tool_calls:</span><br><span class="line">        <span class="keyword">return</span> _parse_final(msg.content <span class="keyword">or</span> <span class="string">&quot;&quot;</span>, steps, truncated=<span class="literal">False</span>)</span><br><span class="line"></span><br><span class="line">    <span class="comment"># dispatch each tool call, append observation as a `tool` message</span></span><br><span class="line">    <span class="keyword">for</span> tc <span class="keyword">in</span> tool_calls:</span><br><span class="line">        observation = _dispatch(tc.function.name, json.loads(tc.function.arguments))</span><br><span class="line">        messages.append(&#123;</span><br><span class="line">            <span class="string">&quot;role&quot;</span>: <span class="string">&quot;tool&quot;</span>, <span class="string">&quot;tool_call_id&quot;</span>: tc.<span class="built_in">id</span>,</span><br><span class="line">            <span class="string">&quot;name&quot;</span>: tc.function.name, <span class="string">&quot;content&quot;</span>: _truncate(observation),</span><br><span class="line">        &#125;)</span><br><span class="line">        steps.append(InvestigationStep(...))</span><br><span class="line">        <span class="keyword">if</span> step_callback:</span><br><span class="line">            step_callback(steps[-<span class="number">1</span>])</span><br></pre></td></tr></table></figure><p>That’s the entire loop. The <code>step_callback</code> is what enables the livetrace in the CLI — each step prints as it lands, so the engineer seesthe planner thinking in real time. Exceptions raised by the callbackare deliberately <em>not</em> caught; they indicate a consumer bug, not aplanner failure.</p><h2 id="A-real-abbreviated-transcript"><a href="#A-real-abbreviated-transcript" class="headerlink" title="A real (abbreviated) transcript"></a>A real (abbreviated) transcript</h2><p>Here’s what <code>anchor compare --deep</code> looks like on the demo datasetwhen the engineer is investigating checkout slowness:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br></pre></td><td class="code"><pre><span class="line">step 1</span><br><span class="line">  thought:   &quot;p95 latency tripled and a new PaymentGatewayTimeout</span><br><span class="line">              template appeared. Both look related to the payment</span><br><span class="line">              path. Let me recall past drifts with overlapping signals.&quot;</span><br><span class="line">  tool:      recall_similar_drifts(</span><br><span class="line">               signals=[&quot;template:appeared:PaymentGatewayTimeout #4a&quot;,</span><br><span class="line">                        &quot;metric:latency_ms:p95&quot;], k=5)</span><br><span class="line">  obs:       [&#123;&quot;id&quot;:&quot;7db2d8aa&quot;,&quot;outcome&quot;:&quot;resolved&quot;,&quot;similarity&quot;:0.71,</span><br><span class="line">               &quot;confirmed_reason&quot;:&quot;payment-svc 2.4.1 timeout regression&quot;&#125;]</span><br><span class="line"></span><br><span class="line">step 2</span><br><span class="line">  thought:   &quot;High overlap with a resolved deploy regression. Reading</span><br><span class="line">              the full record to see what evidence they used.&quot;</span><br><span class="line">  tool:      get_drift_details(drift_id=&quot;7db2d8aa&quot;)</span><br><span class="line">  obs:       &#123;&quot;top_diffs&quot;:[...], &quot;suggested_spl&quot;:</span><br><span class="line">              &quot;index=deploy_log service=payment-svc&quot;&#125;</span><br><span class="line"></span><br><span class="line">step 3</span><br><span class="line">  thought:   &quot;If this is the same class of incident I should see a</span><br><span class="line">              deploy event for payment-svc in the compare window.&quot;</span><br><span class="line">  tool:      run_spl(spl=&quot;index=deploy_log service=payment-svc&quot;,</span><br><span class="line">                     earliest=&quot;2026-06-02T00:00:00&quot;,</span><br><span class="line">                     latest=&quot;2026-06-03T00:00:00&quot;)</span><br><span class="line">  obs:       &#123;&quot;row_count&quot;: 1, &quot;rows&quot;:[&#123;&quot;_time&quot;:&quot;2026-06-02T17:42Z&quot;,</span><br><span class="line">              &quot;service&quot;:&quot;payment-svc&quot;,&quot;version&quot;:&quot;2.4.7&quot;,&quot;actor&quot;:&quot;ci&quot;&#125;]&#125;</span><br><span class="line"></span><br><span class="line">step 4  (final, no tool_calls)</span><br><span class="line">  &#123;</span><br><span class="line">    &quot;summary&quot;: &quot;p95 latency tripled at 17:42, coincident with payment-svc</span><br><span class="line">                deploy 2.4.7. New PaymentGatewayTimeout template matches</span><br><span class="line">                resolved incident 7db2d8aa (payment-svc 2.4.1 timeout</span><br><span class="line">                regression).&quot;,</span><br><span class="line">    &quot;hypothesis&quot;: &quot;deploy regression in payment-svc 2.4.7&quot;,</span><br><span class="line">    &quot;evidence&quot;: [</span><br><span class="line">      &quot;recall: incident 7db2d8aa had Jaccard 0.71 on payment timeout signals&quot;,</span><br><span class="line">      &quot;deploy_log: payment-svc 2.4.7 deployed at 2026-06-02T17:42Z within</span><br><span class="line">       compare window&quot;</span><br><span class="line">    ],</span><br><span class="line">    &quot;confidence&quot;: 0.8</span><br><span class="line">  &#125;</span><br></pre></td></tr></table></figure><p>The whole loop took 4 calls, well under the 6 cap. Two of those werememory recall, one was SPL evidence-gathering, one was the finalsynthesis. That distribution is typical — when the diff has obviousprecedents, the planner spends most of its budget on grounding ratherthan exploration.</p><h2 id="Why-qwen-max-latest-specifically"><a href="#Why-qwen-max-latest-specifically" class="headerlink" title="Why qwen-max-latest specifically"></a>Why <code>qwen-max-latest</code> specifically</h2><p>The narrator runs on <code>qwen-plus</code>. The planner defaults to<code>qwen-max-latest</code> (or whatever the <code>QWEN_PLANNER_MODEL</code> env var is setto). The difference matters:</p><ul><li><strong><code>qwen-plus</code></strong> is fine for one-shot JSON narration. Cheap, fast.</li><li><strong><code>qwen-max-latest</code></strong> has noticeably better function-callingdiscipline — it stops earlier, picks tools more accurately, andfabricates fewer SPL arguments.</li></ul><p>The tier-up is justifiable because <code>--deep</code> is an opt-in command,typically run on a single drift the engineer cares about. If you ranit on every compare you’d burn through your Qwen budget — exactly thereason it’s <code>--deep</code> and not the default.</p><h2 id="Safety-robustness-details"><a href="#Safety-robustness-details" class="headerlink" title="Safety &#x2F; robustness details"></a>Safety &#x2F; robustness details</h2><ul><li><strong>Argument access uses <code>.get()</code> everywhere.</strong> A model that hallucinatesa missing required argument returns <code>{&quot;error&quot;: &quot;...&quot;}</code> from <code>_dispatch</code>instead of crashing the loop.</li><li><strong><code>signal_embedding</code> is stripped from observations.</strong> The recalleddrift records have an optional 1024-dim embedding. The planner can’treason about raw float vectors and it would eat the 8 KB observationbudget. Excluded explicitly.</li><li><strong>Truncation is loud, not silent.</strong> <code>_truncate</code> appends<code>…[truncated, N more chars]</code> so the planner knows it didn’t see thewhole thing and can choose to narrow the SPL.</li><li><strong>The hard cap is honored</strong> — if we hit <code>max_steps</code> without a finalanswer, the result is returned with <code>truncated=True</code> and whateverobservations we gathered. The CLI shows that flag in the renderedreport.</li></ul><h2 id="Tests-for-an-LLM-loop"><a href="#Tests-for-an-LLM-loop" class="headerlink" title="Tests for an LLM loop"></a>Tests for an LLM loop</h2><p>Testing a function-calling agent is hard. We don’t try to test that“Qwen picks the right tool”; that’s not a property of our code.Instead, the tests(<a href="https://github.com/faketut/Anchor/blob/main/tests/test_investigator.py"><code>tests/test_investigator.py</code></a>)fake the OpenAI client and verify the <em>plumbing</em>:</p><ul><li>Tool dispatch routes each name to the right wrapper</li><li><code>_truncate</code> preserves the original length in its breadcrumb</li><li><code>step_callback</code> fires once per step</li><li>The hard cap is honored</li><li><code>_parse_final</code> survives malformed JSON</li></ul><p>That’s the layer worth testing. The LLM’s <em>judgment</em> is best tested byrunning it on real fixtures and reading the output — which is exactlywhat the demo script in <a href="https://github.com/faketut/Anchor/blob/main/examples/demo_script.md"><code>examples/demo_script.md</code></a>does.</p>]]>
      </content:encoded>
    </item>
    <item>
      <title>The Narrator: Putting the LLM Only at the Edge</title>
      <link>https://faketut.github.io/2026/06/17/anchor-04-narrator-llm-at-edge/</link>
      <description>
        <![CDATA[<p>By the time Qwen sees a request, the hard work is already done. The
diff engine (<a href="03-diff-and-weights.md">post 3</a>) has ranked]]>
      </description>
      <author>Jian Feng</author>
      <category domain="https://faketut.github.io/categories/engineering/">engineering</category>
      <category domain="https://faketut.github.io/categories/engineering/observability/">observability</category>
      <category domain="https://faketut.github.io/tags/llm/">llm</category>
      <category domain="https://faketut.github.io/tags/anchor/">anchor</category>
      <category domain="https://faketut.github.io/tags/splunk/">splunk</category>
      <category domain="https://faketut.github.io/tags/sre/">sre</category>
      <category domain="https://faketut.github.io/tags/observability/">observability</category>
      <pubDate>Wed, 17 Jun 2026 01:00:00 GMT</pubDate>
      <content:encoded>
        <![CDATA[<p>By the time Qwen sees a request, the hard work is already done. Thediff engine (<a href="03-diff-and-weights.md">post 3</a>) has ranked the topdiffs. The recall system has fetched the top-3 most similar pastincidents. The LLM’s job is <em>narration</em>: turn structured rows into a2-4 sentence summary, a hypothesis, and one drill-in SPL query.</p><p>This post walks through <a href="https://github.com/faketut/Anchor/blob/main/src/anchor/narrator.py"><code>narrator.py</code></a>and the design choices that keep it cheap, reproducible, and easy toaudit.</p><h2 id="The-full-prompt-verbatim"><a href="#The-full-prompt-verbatim" class="headerlink" title="The full prompt, verbatim"></a>The full prompt, verbatim</h2><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line">SYSTEM_PROMPT = <span class="string">&quot;&quot;&quot;You are Anchor, an observability assistant for Splunk.</span></span><br><span class="line"><span class="string">You are given a set of statistical diffs between a HEALTHY baseline window</span></span><br><span class="line"><span class="string">(the &quot;anchor&quot;) and a CURRENT window being investigated. You may also be</span></span><br><span class="line"><span class="string">given PAST_INCIDENTS — previously-investigated drifts with confirmed</span></span><br><span class="line"><span class="string">outcomes whose signals overlap with the current one.</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">Your job:</span></span><br><span class="line"><span class="string">1. Write a 2-4 sentence SUMMARY in plain English describing what changed.</span></span><br><span class="line"><span class="string">   Lead with the highest-severity diffs. Quantify deltas.</span></span><br><span class="line"><span class="string">2. Propose a single best HYPOTHESIS for the likely cause class</span></span><br><span class="line"><span class="string">   (e.g. &quot;downstream service degradation&quot;, &quot;new error class&quot;, &quot;traffic shift&quot;,</span></span><br><span class="line"><span class="string">    &quot;deploy regression&quot;). If a PAST_INCIDENT with outcome=resolved has high</span></span><br><span class="line"><span class="string">   signal overlap, you SHOULD reference it (by its short id) and lean on its</span></span><br><span class="line"><span class="string">   confirmed_reason. If the past incident was a false_positive, downweight</span></span><br><span class="line"><span class="string">   your concern accordingly.</span></span><br><span class="line"><span class="string">3. Suggest one DRILL_IN SPL query the engineer should run next to confirm.</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">Be concise. Do NOT invent diffs not in the input. Do NOT claim root cause</span></span><br><span class="line"><span class="string">with certainty — use words like &quot;likely&quot;, &quot;suggests&quot;, &quot;consistent with&quot;.</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">Respond as a JSON object with exactly these keys:</span></span><br><span class="line"><span class="string">  summary (string), hypothesis (string or null), drill_in_spl (string or null).</span></span><br><span class="line"><span class="string">&quot;&quot;&quot;</span></span><br></pre></td></tr></table></figure><p>A few things deliberately <em>not</em> in this prompt:</p><ul><li>No examples &#x2F; few-shot. The output schema is strict JSON; examplesbloat the prompt without changing quality.</li><li>No “think step by step”. The deterministic core already did thethinking. We want narration, not chain-of-thought.</li><li>No persona (“You are an expert SRE…”). The role is <code>system</code>; that’sthe persona. Verbose personas pull the model toward filler.</li><li>No claim of certainty. The “use words like ‘likely’, ‘suggests’”instruction is the cheapest hallucination-mitigation we have.</li></ul><h2 id="What-the-model-sees-as-input"><a href="#What-the-model-sees-as-input" class="headerlink" title="What the model sees as input"></a>What the model sees as input</h2><p>The user message is JSON, not prose(<a href="https://github.com/faketut/Anchor/blob/main/src/anchor/narrator.py"><code>_payload()</code></a>):</p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br></pre></td><td class="code"><pre><span class="line"><span class="punctuation">&#123;</span></span><br><span class="line">  <span class="attr">&quot;prompt_version&quot;</span><span class="punctuation">:</span> <span class="number">2</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;anchor_name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;Healthy Week&quot;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;diffs&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;signal&quot;</span><span class="punctuation">:</span> <span class="string">&quot;template:appeared:PaymentGatewayTimeout #4a&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;kind&quot;</span><span class="punctuation">:</span> <span class="string">&quot;template&quot;</span><span class="punctuation">,</span> <span class="attr">&quot;severity&quot;</span><span class="punctuation">:</span> <span class="string">&quot;HIGH&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;anchor_val&quot;</span><span class="punctuation">:</span> <span class="number">0.0</span><span class="punctuation">,</span> <span class="attr">&quot;current_val&quot;</span><span class="punctuation">:</span> <span class="number">148</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;delta_pct&quot;</span><span class="punctuation">:</span> <span class="literal"><span class="keyword">null</span></span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;note&quot;</span><span class="punctuation">:</span> <span class="string">&quot;new pattern (_json): timeout calling stripe.payment.charge&quot;</span></span><br><span class="line">    <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;signal&quot;</span><span class="punctuation">:</span> <span class="string">&quot;metric:latency_ms:p95&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;kind&quot;</span><span class="punctuation">:</span> <span class="string">&quot;metric&quot;</span><span class="punctuation">,</span> <span class="attr">&quot;severity&quot;</span><span class="punctuation">:</span> <span class="string">&quot;HIGH&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;anchor_val&quot;</span><span class="punctuation">:</span> <span class="number">312.4</span><span class="punctuation">,</span> <span class="attr">&quot;current_val&quot;</span><span class="punctuation">:</span> <span class="number">1247.8</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;delta_pct&quot;</span><span class="punctuation">:</span> <span class="number">299.4</span><span class="punctuation">,</span> <span class="attr">&quot;note&quot;</span><span class="punctuation">:</span> <span class="string">&quot;&quot;</span></span><br><span class="line">    <span class="punctuation">&#125;</span></span><br><span class="line">    <span class="comment">// up to 15 diffs</span></span><br><span class="line">  <span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;past_incidents&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span></span><br><span class="line">    <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;id&quot;</span><span class="punctuation">:</span> <span class="string">&quot;7db2d8aa&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;when&quot;</span><span class="punctuation">:</span> <span class="string">&quot;2026-04-12T19:03Z&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;outcome&quot;</span><span class="punctuation">:</span> <span class="string">&quot;resolved&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;confirmed_reason&quot;</span><span class="punctuation">:</span> <span class="string">&quot;payment-svc 2.4.1 timeout regression, rolled back&quot;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;signal_overlap&quot;</span><span class="punctuation">:</span> <span class="number">0.71</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;signals&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span><span class="string">&quot;template:appeared:PaymentGatewayTimeout #4a&quot;</span><span class="punctuation">,</span> <span class="string">&quot;metric:latency_ms:p95&quot;</span><span class="punctuation">]</span></span><br><span class="line">    <span class="punctuation">&#125;</span></span><br><span class="line">    <span class="comment">// up to 3 past incidents</span></span><br><span class="line">  <span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;focus&quot;</span><span class="punctuation">:</span> <span class="string">&quot;checkout slowness&quot;</span></span><br><span class="line"><span class="punctuation">&#125;</span></span><br></pre></td></tr></table></figure><p>Three small choices worth flagging:</p><ol><li><strong><code>prompt_version: 2</code></strong> in the payload. When the prompt or schemachanges, the version bumps. Drift records store this implicitly viathe response shape, so audits can reproduce <em>“which prompt producedthis hypothesis?”</em>.</li><li><strong><code>anchor_val</code> &#x2F; <code>current_val</code> are raw numbers</strong>, not formattedstrings. Lets the model quantify deltas without us pre-baking“3.0×” prose.</li><li><strong><code>past_incidents</code> is bounded at 3.</strong> Not 10, not “all relevant”.The Track-1 requirement is <em>recalling critical memories withinlimited context</em>. Three is enough for grounding without crowdingout the diffs.</li></ol><h2 id="What-the-model-has-to-return"><a href="#What-the-model-has-to-return" class="headerlink" title="What the model has to return"></a>What the model has to return</h2><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">rsp = client.chat.completions.create(</span><br><span class="line">    model=model,</span><br><span class="line">    response_format=&#123;<span class="string">&quot;type&quot;</span>: <span class="string">&quot;json_object&quot;</span>&#125;,</span><br><span class="line">    messages=[</span><br><span class="line">        &#123;<span class="string">&quot;role&quot;</span>: <span class="string">&quot;system&quot;</span>, <span class="string">&quot;content&quot;</span>: SYSTEM_PROMPT&#125;,</span><br><span class="line">        &#123;<span class="string">&quot;role&quot;</span>: <span class="string">&quot;user&quot;</span>,   <span class="string">&quot;content&quot;</span>: _payload(diffs, focus, anchor_name, past_incidents)&#125;,</span><br><span class="line">    ],</span><br><span class="line">    temperature=<span class="number">0.2</span>,</span><br><span class="line">)</span><br></pre></td></tr></table></figure><p>The combination of <code>response_format={&quot;type&quot;: &quot;json_object&quot;}</code> and<code>temperature=0.2</code> is the whole reliability story:</p><ul><li><strong>JSON mode</strong> means Qwen returns syntactically valid JSON everytime. No retry loop, no markdown-fence stripping, no regexextraction.</li><li><strong>Low temperature</strong> keeps the narration boring in a good way. Thesame diffs produce the same summary across runs. SREs are notlooking for creative writing.</li></ul><p>The parsing on the other side is correspondingly mundane:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">data = json.loads(raw)</span><br><span class="line"><span class="keyword">return</span> NarratorResponse(</span><br><span class="line">    summary=data.get(<span class="string">&quot;summary&quot;</span>, <span class="string">&quot;&quot;</span>).strip() <span class="keyword">or</span> <span class="string">&quot;(empty)&quot;</span>,</span><br><span class="line">    hypothesis=(data.get(<span class="string">&quot;hypothesis&quot;</span>) <span class="keyword">or</span> <span class="literal">None</span>),</span><br><span class="line">    drill_in_spl=(data.get(<span class="string">&quot;drill_in_spl&quot;</span>) <span class="keyword">or</span> <span class="literal">None</span>),</span><br><span class="line">)</span><br></pre></td></tr></table></figure><p>No <code>data[&quot;summary&quot;]</code> — every key uses <code>.get(..., default)</code>. If Qwengets weird, we degrade to a sensible empty value instead of throwing.</p><h2 id="Provider-abstraction-the-small-one"><a href="#Provider-abstraction-the-small-one" class="headerlink" title="Provider abstraction (the small one)"></a>Provider abstraction (the small one)</h2><p>Qwen and Gemini both expose OpenAI-compatible chat completionsendpoints. So Anchor’s “multi-provider” support is one functionparameterized over base URL, API key, and model name:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">_openai_compat_narrate</span>(<span class="params">diffs, focus, anchor_name, *,</span></span><br><span class="line"><span class="params">                           api_key, base_url, model, ...</span>):</span><br><span class="line">    <span class="keyword">from</span> openai <span class="keyword">import</span> OpenAI</span><br><span class="line">    client = OpenAI(api_key=api_key, base_url=base_url, timeout=LLM_TIMEOUT_S)</span><br><span class="line">    ...</span><br></pre></td></tr></table></figure><p><code>narrate()</code> is a five-line switch picking which <code>_openai_compat_narrate</code>to call. There used to be a third branch for a hypothetical Splunk-hostedmodel; it was dead code and got deleted in code review. The principle:if no path through your code is exercised, the code is wrong.</p><h2 id="Where-the-LLM-is-in-the-bigger-picture"><a href="#Where-the-LLM-is-in-the-bigger-picture" class="headerlink" title="Where the LLM is in the bigger picture"></a>Where the LLM is in the bigger picture</h2><p>Looking at the system overview from the project README,the LLM sits at the <em>edge</em> of the data pipeline, never in the middle:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">Splunk → fingerprint → diff (weighted) → recall (Jaccard or cosine) → Qwen → user</span><br><span class="line">                       ^^^^^^^^^^^^^^^   ^^^^^^^^^^^^^^^^^^^^^^^^^^^   ^^^^</span><br><span class="line">                       deterministic     deterministic                  narration</span><br></pre></td></tr></table></figure><p>That layering buys us four things, none of which a pure-LLM agentgets:</p><table><thead><tr><th>Property</th><th>Why we get it</th></tr></thead><tbody><tr><td>Reproducibility</td><td>Same window → same top diffs → same prompt input</td></tr><tr><td>Bounded cost</td><td>One LLM call per compare, fixed-size payload</td></tr><tr><td>Auditability</td><td><code>drift_history</code> stores both the structured diffs <em>and</em> the prose; if Qwen was wrong, the structured data is still there</td></tr><tr><td>Graceful degradation</td><td>If Qwen is down, you still see the ranked diffs in the rendered report; the narration just says “(empty)”</td></tr></tbody></table><p>The same principle applies to the optional <code>--deep</code> planner in<a href="05-planner-react-loop.md">post 5</a>: the model only gets to call<em>tools that wrap deterministic code</em>. It never gets to make up SPLthat we then execute blind.</p><h2 id="What-we-deliberately-don’t-do"><a href="#What-we-deliberately-don’t-do" class="headerlink" title="What we deliberately don’t do"></a>What we deliberately don’t do</h2><ul><li><strong>No streaming.</strong> The CLI waits for the full JSON response. Streamingpartial JSON is a parsing headache and the time saved is dwarfed bythe SPL queries that ran before the LLM call anyway.</li><li><strong>No re-ranking by the model.</strong> We send the top-15 already ranked.We don’t ask the model to re-rank; we ask it to <em>narrate</em> the existingranking. The diff engine is the source of truth, not Qwen.</li><li><strong>No tool calls in the basic narrator.</strong> That’s the planner’s job(<a href="05-planner-react-loop.md">post 5</a>). Keeping the basic narratortool-free means <code>anchor compare</code> is always one LLM round-trip andthe latency is predictable.</li><li><strong>No retries on JSON parse failure.</strong> With JSON mode + temperature0.2 this hasn’t happened in months of testing. If it ever does, thefallback returns <code>(empty)</code> and the engineer sees the structured diffs.Better than a hidden retry loop adding latency.</li></ul><h2 id="The-cost-shape"><a href="#The-cost-shape" class="headerlink" title="The cost shape"></a>The cost shape</h2><p>For a normal <code>anchor compare</code> on the demo dataset:</p><table><thead><tr><th>Component</th><th>Approximate cost</th></tr></thead><tbody><tr><td>5 SPL queries</td><td>~250 ms total</td></tr><tr><td>Diff engine (pure Python)</td><td>&lt; 10 ms</td></tr><tr><td>Recall (Jaccard over ~500 rows)</td><td>&lt; 50 ms</td></tr><tr><td>One Qwen <code>qwen-plus</code> call</td><td>~1.5-3 s</td></tr><tr><td>KV write of new drift record</td><td>~30 ms</td></tr></tbody></table><p>The LLM is the dominant tail. Everything else is well below humanperception. If you wanted to speed Anchor up, you’d move from<code>qwen-plus</code> to <code>qwen-turbo</code> — <em>not</em> refactor the pipeline.</p>]]>
      </content:encoded>
    </item>
    <item>
      <title>The Diff: Ranking Severity by What We've Learned Matters</title>
      <link>https://faketut.github.io/2026/06/16/anchor-03-diff-and-weights/</link>
      <description>
        <![CDATA[<p>The diff engine (<a href="https://github.com/faketut/Anchor/blob/main/src/anchor/diff.py"><code>diff.py</code></a>) is the most
boring fi]]>
      </description>
      <author>Jian Feng</author>
      <category domain="https://faketut.github.io/categories/engineering/">engineering</category>
      <category domain="https://faketut.github.io/categories/engineering/observability/">observability</category>
      <category domain="https://faketut.github.io/tags/anchor/">anchor</category>
      <category domain="https://faketut.github.io/tags/splunk/">splunk</category>
      <category domain="https://faketut.github.io/tags/sre/">sre</category>
      <category domain="https://faketut.github.io/tags/observability/">observability</category>
      <pubDate>Tue, 16 Jun 2026 01:00:00 GMT</pubDate>
      <content:encoded>
        <![CDATA[<p>The diff engine (<a href="https://github.com/faketut/Anchor/blob/main/src/anchor/diff.py"><code>diff.py</code></a>) is the mostboring file in the repo on purpose. It’s ~250 lines of pure functionswith zero LLM calls, zero network I&#x2F;O, and zero hidden state. Given ananchor fingerprint and a current one, it returns a ranked list of<code>DiffEntry</code> rows. That’s it.</p><p>The interesting part is what gets multiplied on top of those rowsright before ranking.</p><h2 id="Three-diffs-in-ranked-list-out"><a href="#Three-diffs-in-ranked-list-out" class="headerlink" title="Three diffs in, ranked list out"></a>Three diffs in, ranked list out</h2><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">diff_all</span>(<span class="params">anchor, current, weights=<span class="literal">None</span>, *, limit=<span class="number">20</span></span>):</span><br><span class="line">    weights = weights <span class="keyword">or</span> &#123;&#125;</span><br><span class="line">    entries = (</span><br><span class="line">        volume_diff(anchor, current)</span><br><span class="line">        + template_diff(anchor, current)</span><br><span class="line">        + metric_diff(anchor, current)</span><br><span class="line">    )</span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">rank</span>(<span class="params">e</span>):</span><br><span class="line">        base = SEV_ORDER[e.severity]</span><br><span class="line">        w = weights.get(e.signal, SignalWeight(signal_name=e.signal)).weight</span><br><span class="line">        mag = <span class="built_in">abs</span>(e.delta_pct <span class="keyword">or</span> <span class="number">0.0</span>) / <span class="number">100.0</span></span><br><span class="line">        <span class="keyword">return</span> base * w + mag * <span class="number">0.01</span></span><br><span class="line">    entries.sort(key=rank, reverse=<span class="literal">True</span>)</span><br><span class="line">    <span class="keyword">return</span> entries[:limit]</span><br></pre></td></tr></table></figure><p>That <code>base * w + mag * 0.01</code> is the entire learned-ranking story:</p><ul><li><code>base</code> is a 1 &#x2F; 2 &#x2F; 3 score from <code>LOW</code> &#x2F; <code>MEDIUM</code> &#x2F; <code>HIGH</code>.</li><li><code>w</code> is the learned weight for this signal (default 1.0, floor 0.1,cap 3.0).</li><li><code>mag</code> is a tiny tiebreaker so a 500% change ranks above a 51% changeat the same severity tier.</li></ul><p>The result: if <code>template:payment_4xx_upstream</code> has been <em>confirmed</em>five times in the last quarter, its <code>w</code> is around 1.5. When it showsup again at MEDIUM severity, it ranks ahead of a HIGH-severity<code>volume:foo</code> change with <code>w = 1.0</code>. That’s the system telling you:<em>“You’ve cared about this before. Look here first.”</em></p><h2 id="The-three-classes-of-diff"><a href="#The-three-classes-of-diff" class="headerlink" title="The three classes of diff"></a>The three classes of diff</h2><h3 id="Volume-diff"><a href="#Volume-diff" class="headerlink" title="Volume diff"></a>Volume diff</h3><p>Per-sourcetype event counts. Two interesting edge cases:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">if</span> a == <span class="number">0</span> <span class="keyword">and</span> c == <span class="number">0</span>:</span><br><span class="line">    <span class="keyword">continue</span>                        <span class="comment"># both zero — not interesting</span></span><br><span class="line">delta = _pct_change(a, c)</span><br><span class="line"><span class="keyword">if</span> delta <span class="keyword">is</span> <span class="literal">None</span>:                   <span class="comment"># anchor was 0, current &gt; 0</span></span><br><span class="line">    out.append(DiffEntry(..., delta_pct=<span class="literal">None</span>, severity=<span class="string">&quot;HIGH&quot;</span>,</span><br><span class="line">                         note=<span class="string">&quot;new sourcetype&quot;</span>))</span><br></pre></td></tr></table></figure><p>The “anchor was 0, current is positive” case used to return somemagic percent. That’s been wrong since the first review — there’s nohonest percent change from zero. The fix:</p><blockquote><p><strong>Return <code>None</code> for delta_pct and have the renderer print <code>new</code>instead of a fabricated number.</strong></p></blockquote><p>The diff engine’s only job is to surface signal; lying aboutdivisions-by-zero adds noise.</p><h3 id="Template-diff"><a href="#Template-diff" class="headerlink" title="Template diff"></a>Template diff</h3><p>Three sub-cases against the <code>log_patterns</code> list from<a href="02-fingerprint.md">post 2</a>:</p><table><thead><tr><th>Set operation</th><th>Signal name</th><th>Severity</th></tr></thead><tbody><tr><td>in current, not in anchor</td><td><code>template:appeared:&lt;short&gt;</code></td><td>HIGH if <code>count &gt; 10</code>, else MEDIUM</td></tr><tr><td>in anchor, not in current</td><td><code>template:disappeared:&lt;short&gt;</code></td><td>MEDIUM</td></tr><tr><td>in both, frequency shifted ≥ 50%</td><td><code>template:shifted:&lt;short&gt;</code></td><td>derived from delta</td></tr></tbody></table><p>The <code>&lt;short&gt;</code> is a stable id. It’s the first 32 chars of the templateplus a 6-char MD5 suffix:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">_short</span>(<span class="params">template: <span class="built_in">str</span>, n: <span class="built_in">int</span> = <span class="number">32</span></span>) -&gt; <span class="built_in">str</span>:</span><br><span class="line">    suffix = hashlib.md5(</span><br><span class="line">        template.encode(<span class="string">&quot;utf-8&quot;</span>, errors=<span class="string">&quot;replace&quot;</span>),</span><br><span class="line">        usedforsecurity=<span class="literal">False</span>,</span><br><span class="line">    ).hexdigest()[:<span class="number">6</span>]</span><br><span class="line">    head = template[:n] <span class="keyword">if</span> <span class="built_in">len</span>(template) &lt;= n <span class="keyword">else</span> template[:n] + <span class="string">&quot;...&quot;</span></span><br><span class="line">    <span class="keyword">return</span> <span class="string">f&quot;<span class="subst">&#123;head&#125;</span>#<span class="subst">&#123;suffix&#125;</span>&quot;</span></span><br></pre></td></tr></table></figure><p>Two distinct templates that share a 32-char prefix used to collapseinto the same signal name (and therefore the same learned weight).The MD5 suffix fixes that without losing the human-readable head.(<code>usedforsecurity=False</code> placates Bandit; MD5 here is a hash, not acrypto primitive.)</p><h3 id="Metric-diff"><a href="#Metric-diff" class="headerlink" title="Metric diff"></a>Metric diff</h3><p>For each metric named in <code>--metric latency_ms</code>, we already capturedp50&#x2F;p95&#x2F;p99 in the fingerprint. The diff compares each percentileindividually:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">for</span> pct <span class="keyword">in</span> (<span class="string">&quot;p50&quot;</span>, <span class="string">&quot;p95&quot;</span>, <span class="string">&quot;p99&quot;</span>):</span><br><span class="line">    a_val = <span class="built_in">getattr</span>(a_stats, pct)</span><br><span class="line">    c_val = <span class="built_in">getattr</span>(c_stats, pct)</span><br><span class="line">    delta = _pct_change(a_val, c_val)</span><br><span class="line">    <span class="keyword">if</span> delta <span class="keyword">is</span> <span class="literal">None</span> <span class="keyword">or</span> <span class="built_in">abs</span>(delta) &lt; LOW_DELTA:</span><br><span class="line">        <span class="keyword">continue</span></span><br><span class="line">    out.append(DiffEntry(signal=<span class="string">f&quot;metric:<span class="subst">&#123;name&#125;</span>:<span class="subst">&#123;pct&#125;</span>&quot;</span>, ...))</span><br></pre></td></tr></table></figure><p>That <code>&lt; LOW_DELTA</code> (50%) filter is intentional. A p95 that moved 12%is statistical noise on a one-day window; we don’t want to fill thetop diffs with it.</p><h2 id="The-weights-how-Anchor-learns"><a href="#The-weights-how-Anchor-learns" class="headerlink" title="The weights: how Anchor learns"></a>The weights: how Anchor learns</h2><p>Three constants govern the entire feedback loop(<a href="https://github.com/faketut/Anchor/blob/main/src/anchor/memory.py"><code>memory.py</code></a>):</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">WEIGHT_DELTA = <span class="number">0.1</span>     <span class="comment"># +0.1 on confirmed, -0.2 on false_positive</span></span><br><span class="line">WEIGHT_MIN   = <span class="number">0.1</span>     <span class="comment"># never zero — a signal can always recover</span></span><br><span class="line">WEIGHT_MAX   = <span class="number">3.0</span>     <span class="comment"># never dominant — diversity matters</span></span><br></pre></td></tr></table></figure><p>When you run <code>anchor feedback &lt;id&gt; --outcome resolved</code>, every signalin that drift’s <code>top_diffs</code> gets <code>weight += 0.1</code>. On<code>--outcome false_positive</code>, every signal gets <code>weight -= 0.2</code> (theasymmetry is deliberate — false positives are more painful than missedcatches, so the penalty bites harder).</p><p>That alone would be enough to <em>learn</em>. The harder problem is<strong>forgetting</strong>.</p><h2 id="Timely-forgetting-weights-decay-halfway-every-30-days"><a href="#Timely-forgetting-weights-decay-halfway-every-30-days" class="headerlink" title="Timely forgetting: weights decay halfway every 30 days"></a>Timely forgetting: weights decay halfway every 30 days</h2><p>The Track-1 hackathon requirement says <em>“timely forgetting ofoutdated information”</em>. Anchor implements that as exponential decaytoward the neutral value 1.0:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">DECAY_HALF_LIFE_DAYS = <span class="number">30.0</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">decay_weights</span>(<span class="params">now, half_life_days=DECAY_HALF_LIFE_DAYS</span>):</span><br><span class="line">    skip_cutoff = now - timedelta(hours=DECAY_SKIP_RECENT_HOURS)</span><br><span class="line">    <span class="keyword">for</span> d <span class="keyword">in</span> kv_all(<span class="string">&quot;signal_weights&quot;</span>):</span><br><span class="line">        w = SignalWeight.model_validate(d)</span><br><span class="line">        <span class="keyword">if</span> w.last_updated <span class="keyword">and</span> w.last_updated &gt; skip_cutoff:</span><br><span class="line">            <span class="keyword">continue</span>                                <span class="comment"># too fresh, don&#x27;t decay</span></span><br><span class="line">        age_days = (now - w.last_updated).total_seconds() / <span class="number">86400.0</span></span><br><span class="line">        factor = <span class="number">0.5</span> ** (age_days / half_life_days)</span><br><span class="line">        new_weight = <span class="number">1.0</span> + (w.weight - <span class="number">1.0</span>) * factor</span><br><span class="line">        ...</span><br></pre></td></tr></table></figure><p>Read that line by line:</p><ul><li><strong><code>factor = 0.5 ** (age_days / half_life_days)</code></strong> — classichalf-life. After 30 idle days, factor is 0.5. After 60 days, 0.25.After 90 days, 0.125.</li><li><strong><code>new_weight = 1.0 + (w.weight - 1.0) * factor</code></strong> — pulls theweight <em>toward</em> 1.0, never past it. A weight at 1.5 decays to1.25 at 30 days, 1.125 at 60 days, etc.</li><li><strong><code>w.last_updated &gt; skip_cutoff</code></strong> — the 24-hour grace windowprevents a freshly-confirmed signal from being immediately washedout by decay on the next compare.</li></ul><p>There’s a subtle invariant in the caller(<a href="https://github.com/faketut/Anchor/blob/main/src/anchor/agent.py"><code>agent.compare</code></a>) worth flagging:</p><blockquote><p><em>Always call <code>get_weights()</code> BEFORE <code>bump_appearance()</code>.</em></p></blockquote><p><code>get_weights()</code> triggers decay-and-write. <code>bump_appearance()</code> thenwrites appearance counters. If you reverse them, you’d overwrite thedecayed <code>weight</code> value with a stale snapshot. The docstring on<code>bump_appearance</code> calls this out explicitly because it’s the kind ofbug a future refactor would silently reintroduce.</p><h2 id="A-small-operational-detail"><a href="#A-small-operational-detail" class="headerlink" title="A small operational detail"></a>A small operational detail</h2><p>A previous schema didn’t have <code>last_updated</code>. Rows written under thatschema can’t decay (we don’t know when they were last touched). Ratherthan fabricate a date, <code>decay_weights</code> counts them and emits a one-shotbreadcrumb to stderr:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">anchor: 4 signal_weights row(s) have no `last_updated` and will not</span><br><span class="line">decay; run `anchor feedback` on the corresponding signal once to backfill.</span><br></pre></td></tr></table></figure><p>The first <code>feedback</code> call on each backfills <code>last_updated</code>. After thatthey participate in decay like everyone else. No migration scriptneeded — the system heals itself in normal use.</p><h2 id="What-this-looks-like-to-the-engineer"><a href="#What-this-looks-like-to-the-engineer" class="headerlink" title="What this looks like to the engineer"></a>What this looks like to the engineer</h2><p>Run <code>anchor learned</code> to see the current weight table sorted bydeviation from 1.0:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">SIGNAL                                       WEIGHT  CONFIRMED  FALSE_POS  LAST_UPDATED</span><br><span class="line">template:appeared:PaymentGatewayTimeout #4a  2.10    9          0          2026-06-12 14:30Z</span><br><span class="line">metric:latency_ms:p95                        1.45    4          0          2026-06-14 09:11Z</span><br><span class="line">template:shifted:GC_pause_long #d2           0.62    0          3          2026-06-08 22:04Z</span><br><span class="line">template:appeared:DebugLogEntry #91          0.10    0          7          2026-05-29 17:55Z</span><br></pre></td></tr></table></figure><p>That table is the system’s memory in human-readable form. The firsttwo are <em>learned signal — pay attention</em>. The last two are <em>learnednoise — please stop alerting on this</em>. Without decay, the noise rowswould stay at 0.1 forever even after the underlying issue is fixed.With decay, they’ll drift back toward 1.0 over a few months — and thenext time the same template legitimately appears, the engineer’sfeedback re-bias-ifies it from scratch.</p><h2 id="Why-this-matters-for-the-LLM"><a href="#Why-this-matters-for-the-LLM" class="headerlink" title="Why this matters for the LLM"></a>Why this matters for the LLM</h2><p>The narrator in <a href="04-narrator-llm-at-edge.md">post 4</a> only sees thetop 15 diffs (<code>diff_all(..., limit=15)</code>). So the <em>ranking</em> is the mostconsequential piece of state in the whole pipeline. Get the rankingright and the LLM has a fighting chance. Get it wrong — by leavingweights flat at 1.0 forever, say — and Qwen ends up narrating noise.</p><p>The weight table <em>is</em> the system getting smarter across sessions.</p>]]>
      </content:encoded>
    </item>
    <item>
      <title>The Fingerprint: Turning a Healthy Week into a Row in KV Store</title>
      <link>https://faketut.github.io/2026/06/15/anchor-02-fingerprint/</link>
      <description>
        <![CDATA[<p>When you run</p>
<figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</s]]>
      </description>
      <author>Jian Feng</author>
      <category domain="https://faketut.github.io/categories/engineering/">engineering</category>
      <category domain="https://faketut.github.io/categories/engineering/observability/">observability</category>
      <category domain="https://faketut.github.io/tags/anchor/">anchor</category>
      <category domain="https://faketut.github.io/tags/splunk/">splunk</category>
      <category domain="https://faketut.github.io/tags/sre/">sre</category>
      <category domain="https://faketut.github.io/tags/observability/">observability</category>
      <pubDate>Mon, 15 Jun 2026 01:00:00 GMT</pubDate>
      <content:encoded>
        <![CDATA[<p>When you run</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">anchor capture --name <span class="string">&quot;Healthy Week&quot;</span> \</span><br><span class="line">  --from 2026-05-20T00:00:00 --to 2026-05-27T00:00:00 \</span><br><span class="line">  --index main --metric latency_ms</span><br></pre></td></tr></table></figure><p>…the CLI does two things: it runs <strong>five SPL queries</strong> against Splunkto characterize the window, and it writes <strong>one document</strong> into the<code>anchors</code> KV Store collection. This post unpacks both halves.</p><h2 id="What’s-in-a-fingerprint"><a href="#What’s-in-a-fingerprint" class="headerlink" title="What’s in a fingerprint"></a>What’s in a fingerprint</h2><p>The <code>Fingerprint</code> model(<a href="https://github.com/faketut/Anchor/blob/main/src/anchor/models.py"><code>models.py</code></a>) carries five fields:</p><table><thead><tr><th>Field</th><th>What it captures</th><th>SPL flavor</th></tr></thead><tbody><tr><td><code>event_volume</code></td><td>per-sourcetype counts, total, hourly profile</td><td><code>stats count by sourcetype</code> + <code>bin _time span=1h</code></td></tr><tr><td><code>log_patterns</code></td><td>top-N “shape” buckets via Splunk’s built-in <code>punct</code> field</td><td><code>stats count, values(_raw) by punct | sort -count | head 50</code></td></tr><tr><td><code>error_rates</code></td><td>error &#x2F; warn &#x2F; info ratio per sourcetype</td><td><code>eval _lvl=case(...) | stats sum(eval(...))</code></td></tr><tr><td><code>key_metrics</code></td><td>p50&#x2F;p95&#x2F;p99&#x2F;mean&#x2F;stddev for named numeric fields</td><td><code>stats perc50(x) as x_p50, perc95(x) as x_p95, ...</code></td></tr><tr><td><code>top_hosts</code></td><td>top-20 hosts by event count</td><td><code>top limit=20 host</code></td></tr></tbody></table><p>That’s deliberately a <em>small</em> feature set. Anchor isn’t trying to be anML platform; it’s trying to capture the cheapest possible summary thatstill discriminates <em>“yesterday looked like the baseline”</em> from<em>“yesterday is different and here’s how”</em>.</p><h2 id="Why-punct-instead-of-clustering"><a href="#Why-punct-instead-of-clustering" class="headerlink" title="Why punct instead of clustering"></a>Why <code>punct</code> instead of clustering</h2><p>The cheapest log-template proxy in Splunk is the built-in <code>punct</code>field — it’s the punctuation skeleton of the event, computed at indextime. <code>[ERROR] payment 4xx upstream svc=stripe id=...</code> and the sameline with a different request id collapse to the same <code>punct</code>. Noclustering library, no Levenshtein, no LLM call.</p><p>That decision shows up directly in the SPL builder(<a href="https://github.com/faketut/Anchor/blob/main/src/anchor/fingerprint.py"><code>fingerprint.py</code></a>):</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">_spl_patterns</span>(<span class="params">scope: Scope</span>) -&gt; <span class="built_in">str</span>:</span><br><span class="line">    base = _index_filter(scope)</span><br><span class="line">    <span class="keyword">return</span> (</span><br><span class="line">        <span class="string">f&quot;<span class="subst">&#123;base&#125;</span> | eval _punct=if(isnull(punct),\&quot;&lt;none&gt;\&quot;,punct) &quot;</span></span><br><span class="line">        <span class="string">f&quot;| stats count, values(sourcetype) as sourcetype, &quot;</span></span><br><span class="line">        <span class="string">f&quot;       values(_raw) as examples by _punct &quot;</span></span><br><span class="line">        <span class="string">f&quot;| sort -count | head 50 &quot;</span></span><br><span class="line">        <span class="string">f&quot;| eval example=mvindex(examples,0), &quot;</span></span><br><span class="line">        <span class="string">f&quot;       sourcetype=mvindex(sourcetype,0)&quot;</span></span><br><span class="line">    )</span><br></pre></td></tr></table></figure><p>The <code>head 50</code> cap is intentional: we want the top-N representativepatterns, not every long-tail one-off. If a new pattern enters thetop-50 in a future window, that’s a <code>template:appeared:...</code> signal in<a href="03-diff-and-weights.md">post 3</a>‘s diff engine. If a known pattern<em>falls out</em> of the top-50, that’s <code>template:disappeared:...</code>.</p><h2 id="The-trust-boundary-SPL-injection"><a href="#The-trust-boundary-SPL-injection" class="headerlink" title="The trust boundary: SPL injection"></a>The trust boundary: SPL injection</h2><p>The CLI accepts <code>--index foo --sourcetype bar --metric x</code>. Thosetokens get spliced into SPL strings. That’s exactly the place amalicious value like <code>&#39;foo;|delete&#39;</code> could try to escape the searchcontext.</p><p>Defense in depth: a whitelist of safe identifier characters, appliedto every token before it touches SPL:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">_TOKEN_RE = re.<span class="built_in">compile</span>(<span class="string">r&quot;^[A-Za-z0-9_*\-]+$&quot;</span>)</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">_safe_token</span>(<span class="params">s: <span class="built_in">str</span>, kind: <span class="built_in">str</span></span>) -&gt; <span class="built_in">str</span>:</span><br><span class="line">    <span class="keyword">if</span> <span class="keyword">not</span> _TOKEN_RE.<span class="keyword">match</span>(s):</span><br><span class="line">        <span class="keyword">raise</span> ValueError(<span class="string">f&quot;unsafe <span class="subst">&#123;kind&#125;</span> token: <span class="subst">&#123;s!r&#125;</span>&quot;</span>)</span><br><span class="line">    <span class="keyword">return</span> s</span><br></pre></td></tr></table></figure><p>The CLI is the trust boundary, but a defence-in-depth whitelist coststwo lines and closes a footgun.</p><h2 id="From-Fingerprint-to-KV-row"><a href="#From-Fingerprint-to-KV-row" class="headerlink" title="From Fingerprint to KV row"></a>From <code>Fingerprint</code> to KV row</h2><p>The persistence layer(<a href="https://github.com/faketut/Anchor/blob/main/src/anchor/memory.py"><code>memory.py</code></a>) wraps the fingerprint inan <code>Anchor</code> envelope, assigns a UUID, and writes one document:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">save_anchor</span>(<span class="params">name, start, end, scope, fp</span>) -&gt; Anchor:</span><br><span class="line">    ensure_collections()</span><br><span class="line">    anchor = Anchor(</span><br><span class="line">        <span class="built_in">id</span>=<span class="built_in">str</span>(uuid.uuid4()),</span><br><span class="line">        name=name,</span><br><span class="line">        created_at=datetime.now(timezone.utc),</span><br><span class="line">        created_by=getpass.getuser(),</span><br><span class="line">        time_range=TimeRange(start=start, end=end),</span><br><span class="line">        scope=scope,</span><br><span class="line">        fingerprint=fp,</span><br><span class="line">    )</span><br><span class="line">    doc = json.loads(anchor.model_dump_json())</span><br><span class="line">    doc[<span class="string">&quot;_key&quot;</span>] = anchor.<span class="built_in">id</span></span><br><span class="line">    kv_insert(<span class="string">&quot;anchors&quot;</span>, doc)</span><br><span class="line">    <span class="keyword">return</span> anchor</span><br></pre></td></tr></table></figure><p>Two small things worth noting:</p><ol><li><strong><code>ensure_collections()</code> is idempotent.</strong> The first <code>anchor capture</code>on a fresh Splunk creates the three collections; subsequent callsare a no-op. This is what makes the <code>setup_ecs.sh</code> install in<a href="06-deploy-alibaba-cloud.md">post 6</a> survive re-runs.</li><li><strong><code>_key = anchor.id</code>.</strong> KV Store auto-assigns a key if you don’t,but we want the UUID to <em>be</em> the key so <code>kv_get(&quot;anchors&quot;, id)</code> isa direct lookup rather than a query.</li></ol><h2 id="What-an-anchor-looks-like-in-JSON"><a href="#What-an-anchor-looks-like-in-JSON" class="headerlink" title="What an anchor looks like in JSON"></a>What an anchor looks like in JSON</h2><p>Roughly:</p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"><span class="punctuation">&#123;</span></span><br><span class="line">  <span class="attr">&quot;_key&quot;</span><span class="punctuation">:</span> <span class="string">&quot;8d3a...&quot;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;id&quot;</span><span class="punctuation">:</span>   <span class="string">&quot;8d3a...&quot;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;name&quot;</span><span class="punctuation">:</span> <span class="string">&quot;Healthy Week&quot;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;created_at&quot;</span><span class="punctuation">:</span> <span class="string">&quot;2026-05-27T18:42:11Z&quot;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;created_by&quot;</span><span class="punctuation">:</span> <span class="string">&quot;fenjian&quot;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;time_range&quot;</span><span class="punctuation">:</span> <span class="punctuation">&#123;</span><span class="attr">&quot;start&quot;</span><span class="punctuation">:</span> <span class="string">&quot;2026-05-20T00:00:00Z&quot;</span><span class="punctuation">,</span> <span class="attr">&quot;end&quot;</span><span class="punctuation">:</span> <span class="string">&quot;2026-05-27T00:00:00Z&quot;</span><span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;scope&quot;</span><span class="punctuation">:</span> <span class="punctuation">&#123;</span><span class="attr">&quot;indexes&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span><span class="string">&quot;main&quot;</span><span class="punctuation">]</span><span class="punctuation">,</span> <span class="attr">&quot;sourcetypes&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span><span class="punctuation">]</span><span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;fingerprint&quot;</span><span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">    <span class="attr">&quot;event_volume&quot;</span><span class="punctuation">:</span>  <span class="punctuation">&#123;</span><span class="attr">&quot;per_source&quot;</span><span class="punctuation">:</span> <span class="punctuation">&#123;</span><span class="attr">&quot;_json&quot;</span><span class="punctuation">:</span> <span class="number">412380</span><span class="punctuation">&#125;</span><span class="punctuation">,</span> <span class="attr">&quot;total&quot;</span><span class="punctuation">:</span> <span class="number">412380</span><span class="punctuation">,</span> <span class="attr">&quot;hourly_profile&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span>...<span class="punctuation">]</span><span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">&quot;log_patterns&quot;</span><span class="punctuation">:</span>  <span class="punctuation">[</span><span class="punctuation">&#123;</span><span class="attr">&quot;template&quot;</span><span class="punctuation">:</span> <span class="string">&quot;...&quot;</span><span class="punctuation">,</span> <span class="attr">&quot;frequency_pct&quot;</span><span class="punctuation">:</span> <span class="number">18.4</span><span class="punctuation">,</span> <span class="attr">&quot;count&quot;</span><span class="punctuation">:</span> <span class="number">75880</span><span class="punctuation">,</span> ...<span class="punctuation">&#125;</span><span class="punctuation">,</span> ...<span class="punctuation">]</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">&quot;error_rates&quot;</span><span class="punctuation">:</span>   <span class="punctuation">&#123;</span><span class="attr">&quot;_json&quot;</span><span class="punctuation">:</span> <span class="punctuation">&#123;</span><span class="attr">&quot;error_count&quot;</span><span class="punctuation">:</span> <span class="number">142</span><span class="punctuation">,</span> <span class="attr">&quot;warn_count&quot;</span><span class="punctuation">:</span> <span class="number">503</span><span class="punctuation">,</span> <span class="attr">&quot;total&quot;</span><span class="punctuation">:</span> <span class="number">412380</span><span class="punctuation">&#125;</span><span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">&quot;key_metrics&quot;</span><span class="punctuation">:</span>   <span class="punctuation">&#123;</span><span class="attr">&quot;latency_ms&quot;</span><span class="punctuation">:</span> <span class="punctuation">&#123;</span><span class="attr">&quot;p50&quot;</span><span class="punctuation">:</span> <span class="number">78.1</span><span class="punctuation">,</span> <span class="attr">&quot;p95&quot;</span><span class="punctuation">:</span> <span class="number">312.4</span><span class="punctuation">,</span> ...<span class="punctuation">&#125;</span><span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">    <span class="attr">&quot;top_hosts&quot;</span><span class="punctuation">:</span>     <span class="punctuation">[</span><span class="punctuation">&#123;</span><span class="attr">&quot;host&quot;</span><span class="punctuation">:</span> <span class="string">&quot;checkout-7d4b...&quot;</span><span class="punctuation">,</span> <span class="attr">&quot;event_count&quot;</span><span class="punctuation">:</span> <span class="number">41280</span><span class="punctuation">&#125;</span><span class="punctuation">,</span> ...<span class="punctuation">]</span></span><br><span class="line">  <span class="punctuation">&#125;</span></span><br><span class="line"><span class="punctuation">&#125;</span></span><br></pre></td></tr></table></figure><p>You can inspect this in the Splunk Web UI under <em>Settings → Lookups →KV store collections → <code>anchors</code></em>. Useful sanity check on firstinstall.</p><h2 id="What-this-buys-us"><a href="#What-this-buys-us" class="headerlink" title="What this buys us"></a>What this buys us</h2><p>Two superpowers, one each for the next two posts:</p><ul><li><strong>Post 3</strong> — every later <code>compare</code> re-runs the same five SPL querieson a <em>different</em> window, produces a second <code>Fingerprint</code>, and thediff engine subtracts the two. Pure functions, no LLM.</li><li><strong>Post 4</strong> — when the LLM eventually does see the data, it sees the<em>ranked diff</em>, not raw logs. That keeps the prompt small and thecost bounded.</li></ul><h2 id="What-we-didn’t-include-and-why"><a href="#What-we-didn’t-include-and-why" class="headerlink" title="What we didn’t include (and why)"></a>What we didn’t include (and why)</h2><ul><li><strong>No raw events.</strong> Anchor stores statistics, not log payloads. PIIstays in your indexers; the fingerprint is safe to ship anywhere.</li><li><strong>No embeddings on the anchor itself.</strong> We embed <em>signals</em> (post 3),not raw text. One embedding per drift, not per event.</li><li><strong>No “trends”.</strong> A baseline is a single window. If you want tocapture weekly seasonality, capture multiple baselines and pick theone whose <code>scope</code> matches the compare window. Simpler thangeneralizing.</li></ul>]]>
      </content:encoded>
    </item>
    <item>
      <title>Why a MemoryAgent for on-call</title>
      <link>https://faketut.github.io/2026/06/14/anchor-01-why-memoryagent/</link>
      <description>
        <![CDATA[<h2 id="The-2-a-m-problem"><a href="#The-2-a-m-problem" class="headerlink" title="The 2 a.m. problem"></a>The 2 a.m. problem</h2><p>Every on]]>
      </description>
      <author>Jian Feng</author>
      <category domain="https://faketut.github.io/categories/engineering/">engineering</category>
      <category domain="https://faketut.github.io/categories/engineering/observability/">observability</category>
      <category domain="https://faketut.github.io/tags/llm/">llm</category>
      <category domain="https://faketut.github.io/tags/anchor/">anchor</category>
      <category domain="https://faketut.github.io/tags/splunk/">splunk</category>
      <category domain="https://faketut.github.io/tags/sre/">sre</category>
      <category domain="https://faketut.github.io/tags/observability/">observability</category>
      <pubDate>Sun, 14 Jun 2026 01:00:00 GMT</pubDate>
      <content:encoded>
        <![CDATA[<h2 id="The-2-a-m-problem"><a href="#The-2-a-m-problem" class="headerlink" title="The 2 a.m. problem"></a>The 2 a.m. problem</h2><p>Every on-call engineer has had this moment: pager goes off, you openSplunk, you stare at a wall of graphs, and the first 15–20 minutesevaporate into the same question:</p><blockquote><p><em>“Wait — what does normal even look like for this service?”</em></p></blockquote><p>You’d think tools would solve this by now. They don’t, because theysolve adjacent problems:</p><ul><li><strong>Anomaly detection</strong> trains on a sliding window of recent history.If your service has been quietly degrading for a week, “recent” isalready drifted; the model thinks today’s badness is normal.</li><li><strong>LLM chatbots</strong> answer <em>“is this weird?”</em> once, then forget. Thenext compare starts from zero.</li><li><strong>Static dashboards</strong> show you the numbers but don’t say what’s<em>different</em>. You’re still the one doing the diff in your head.</li></ul><p>Anchor’s bet is that what an SRE actually wants is closer to<code>git diff</code> than to <code>kibana --auto-detect</code>. Pick a reference state,compare a window against it, and get a <em>narrative</em> about the delta —not just the delta itself.</p><h2 id="The-three-memories"><a href="#The-three-memories" class="headerlink" title="The three memories"></a>The three memories</h2><p>For that narrative to get better over time, the agent has to rememberthree things. Each lives in a separate<a href="https://docs.splunk.com/Documentation/Splunk/latest/Admin/AboutKVstore">Splunk KV Store</a>collection:</p><table><thead><tr><th>Memory</th><th>Collection</th><th>What it does</th></tr></thead><tbody><tr><td>What “healthy” looked like</td><td><code>anchors</code></td><td>A human-curated baseline. Survives raw-log retention. Diff against this, not against yesterday.</td></tr><tr><td>Which signals actually matter</td><td><code>signal_weights</code></td><td>Re-ranks diffs by accumulated feedback. Confirmed signals weigh more; false positives weigh less.</td></tr><tr><td>What we did about it last time</td><td><code>drift_history</code></td><td>Every past compare, with engineer-confirmed reasons attached. Recall the most similar one on every new compare.</td></tr></tbody></table><p>This is the MemoryAgent loop in one sentence:</p><blockquote><p><em>Each <code>compare</code> reads <code>signal_weights</code> (learned ranking) and<code>drift_history</code> (recalled past incidents) before calling the LLM,then writes a new drift record. Each <code>feedback</code> updates<code>signal_weights</code>.</em></p></blockquote><h2 id="Where-the-LLM-fits"><a href="#Where-the-LLM-fits" class="headerlink" title="Where the LLM fits"></a>Where the LLM fits</h2><p>The LLM is <em>not</em> the decision layer. Look at the compare lifecycle:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">CLI → KV: load anchor + weights (apply decay)</span><br><span class="line">CLI → Splunk: SPL (fingerprint queries)</span><br><span class="line">CLI → CLI: diff + rank (severity × weight)         ← deterministic</span><br><span class="line">CLI → KV: recall top-3 similar past drifts          ← deterministic</span><br><span class="line">CLI → Qwen: ranked diffs + past incidents</span><br><span class="line">Qwen → CLI: summary + hypothesis + SPL             ← LLM only here</span><br><span class="line">CLI → KV: save new drift record</span><br></pre></td></tr></table></figure><p>By the time Qwen sees the request, the data is already structured,ranked, and accompanied by precedent. The LLM’s job is <em>narration</em>,not detection. That keeps the system:</p><ul><li><strong>Reproducible.</strong> The same window always produces the same top diffs.</li><li><strong>Cheap.</strong> One Qwen call per investigation, not per data point.</li><li><strong>Debuggable.</strong> When a hypothesis is wrong, you can inspect theranked diffs and decide whether the diff engine or the LLM was theweak link.</li></ul><p>We’ll come back to this in <a href="04-narrator-llm-at-edge.md">post 4</a>.</p><h2 id="Why-on-top-of-Splunk"><a href="#Why-on-top-of-Splunk" class="headerlink" title="Why on top of Splunk?"></a>Why on top of Splunk?</h2><p>Three pragmatic reasons:</p><ol><li><strong>Most SREs already have it.</strong> Anchor doesn’t ship a new database;it uses KV Store, which ships with Splunk. No Lambda, no VPC, noextra monthly bill.</li><li><strong>SPL is already the lingua franca</strong> for “show me events with theseshapes in this window”. The fingerprint extractor in<a href="https://github.com/faketut/Anchor/blob/main/src/anchor/fingerprint.py"><code>fingerprint.py</code></a>is, fundamentally, five SPL queries.</li><li><strong>KV Store survives log retention.</strong> Your raw logs roll off in90 days; your healthy anchor doesn’t.</li></ol><p>We’ll look at how a single <code>anchor capture</code> call becomes one row inKV Store in <a href="02-fingerprint.md">post 2</a>.</p><h2 id="The-hackathon-framing-briefly"><a href="#The-hackathon-framing-briefly" class="headerlink" title="The hackathon framing (briefly)"></a>The hackathon framing (briefly)</h2><p>Anchor was built for the <em>Qwen Cloud × Splunk</em> hackathon’s MemoryAgenttrack. The track asks for four specific properties:</p><table><thead><tr><th>Track-1 requirement</th><th>Anchor implementation</th></tr></thead><tbody><tr><td>Persistent memory</td><td>three KV Store collections, nightly OSS backups</td></tr><tr><td>Accumulates experience</td><td><code>apply_feedback()</code> mutates <code>signal_weights</code></td></tr><tr><td>Better decisions across sessions</td><td><code>diff_all()</code> ranks by <code>severity × weight</code></td></tr><tr><td>Timely forgetting</td><td><code>decay_weights()</code> pulls weights halfway to 1.0 every 30 days</td></tr><tr><td>Bounded recall under context limit</td><td><code>recall_similar_drifts()</code> returns top-3</td></tr></tbody></table><p>Posts 3 and 5 dig into the math behind two of those — <em>timelyforgetting</em> and <em>bounded recall</em>.</p><h2 id="What-you’ll-get-from-the-rest-of-the-series"><a href="#What-you’ll-get-from-the-rest-of-the-series" class="headerlink" title="What you’ll get from the rest of the series"></a>What you’ll get from the rest of the series</h2><ul><li><strong><a href="02-fingerprint.md">Post 2</a></strong> — how five SPL queries become a<code>Fingerprint</code> object and one KV row.</li><li><strong><a href="03-diff-and-weights.md">Post 3</a></strong> — the diff engine and thedecay-toward-1.0 trick that lets the agent forget on a schedule.</li><li><strong><a href="04-narrator-llm-at-edge.md">Post 4</a></strong> — what we send to Qwen,what we get back, and why JSON-mode + low temperature beat free-text.</li><li><strong><a href="05-planner-react-loop.md">Post 5</a></strong> — the optional <code>--deep</code>function-calling planner, with a real transcript.</li><li><strong><a href="06-deploy-alibaba-cloud.md">Post 6</a></strong> — three commands to bringthe backend up on Alibaba Cloud ECS, with OSS backups.</li></ul>]]>
      </content:encoded>
    </item>
    <item>
      <title>Why I Wrote My Paper in Typst Instead of LaTeX</title>
      <link>https://faketut.github.io/2026/06/08/qmj-06-typst-instead-of-latex/</link>
      <description>
        <![CDATA[<p>The QMJ-TSX working paper is written in <a href="https://typst.app/">Typst</a>,
not LaTeX. I expected this to be a minor technical choice]]>
      </description>
      <author>Jian Feng</author>
      <category domain="https://faketut.github.io/categories/engineering/">engineering</category>
      <category domain="https://faketut.github.io/categories/engineering/tooling/">tooling</category>
      <category domain="https://faketut.github.io/tags/qmj-tsx/">qmj-tsx</category>
      <category domain="https://faketut.github.io/tags/typst/">typst</category>
      <category domain="https://faketut.github.io/tags/tooling/">tooling</category>
      <category domain="https://faketut.github.io/tags/latex/">latex</category>
      <category domain="https://faketut.github.io/tags/writing/">writing</category>
      <category domain="https://faketut.github.io/tags/academic-writing/">academic-writing</category>
      <pubDate>Mon, 08 Jun 2026 01:00:00 GMT</pubDate>
      <content:encoded>
        <![CDATA[<p>The QMJ-TSX working paper is written in <a href="https://typst.app/">Typst</a>,not LaTeX. I expected this to be a minor technical choice and gotmildly surprised by how much it changed the writing experience.This is a short post on what I gained, what I gave up, and when I’dmake the same choice again.</p><h2 id="The-short-version"><a href="#The-short-version" class="headerlink" title="The short version"></a>The short version</h2><p>Typst is a modern typesetting system in the LaTeX tradition: sameproblem (turn marked-up text into a beautiful PDF), same audience(scientific writing), incomparably better tooling underneath. Itcompiles in milliseconds instead of seconds, the error messagespoint at the line you actually wrote, and the source file lookslike Markdown with a stricter dialect rather than like a 1980smacro language.</p><p>For a solo working paper with a handful of tables, figures, andreferences, the answer is: just use it.</p><h2 id="What-I-gained"><a href="#What-I-gained" class="headerlink" title="What I gained"></a>What I gained</h2><p><strong>Compile speed.</strong> <code>typst compile paper/main.typ</code> runs in roughly100ms on this project. <code>typst watch</code> recompiles on save with nohuman-perceptible latency. Writing with a live preview pane next tothe source file is the closest I have ever come to “writing a paperfeels like writing code.” I edited a sentence, glanced right, kepttyping. That feedback loop matters more than I’d have guessed.</p><p><strong>Readable source.</strong> The <code>paper/main.typ</code> driver fits on a screen.A section file looks like this:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">= Results &lt;sec:results&gt;</span><br><span class="line"></span><br><span class="line">== Replication baseline (AQR QMJ-Canada)</span><br><span class="line">...the Sharpe ratio is $0.64$. The maximum drawdown is $-37.0%$.</span><br><span class="line">A Carhart-style four-factor regression... yields a monthly alpha of</span><br><span class="line">$0.70%$ ($t = 4.46$).</span><br></pre></td></tr></table></figure><p>Compare to the same content in LaTeX, where the equivalent prose isinterrupted by <code>\section{}</code>, <code>\label{}</code>, <code>\textbf{}</code>, <code>$\backslash$</code>,and <code>\%</code> everywhere. Typst’s <code>=</code> for headings, <code>*bold*</code>, and bare<code>$math$</code> get out of the way of the words.</p><p><strong>Error messages that point at the line you wrote.</strong> LaTeX errorsare famously cryptic because the macro expansion has alreadyhappened by the time the compiler complains. Typst errors say“there is a problem on line 47 of <code>results.typ</code>, here’s the offendingtoken.” This is not a small thing when you are debugging at 11pmthe night before a deadline.</p><p><strong>A real module system.</strong> <code>paper/main.typ</code> just <code>#include</code>s<code>sections/*.typ</code> and <code>tables/*.typ</code>. Each table lives in its ownfile. There is no <code>\input</code> weirdness, no preamble bloat, no fragile<code>\newcommand</code> resolution order. The project structure mirrors what Iwould do in any other codebase.</p><p><strong>Native programmability without TeX-flavoured pain.</strong> Typst is areal expression language. Generating a parameterised table or acaption from a value is <code>#let x = 0.64 ... #x</code> rather than<code>\def\x{0.64} ... \x</code>. I don’t lean on this much in this paper, butthe headroom is there if I want to wire numerical outputs from thePython pipeline straight into the paper later.</p><p><strong>Single binary, no distribution.</strong> <code>brew install typst</code> and you’redone. No 4 GB MacTeX install, no <code>tlmgr update --self --all</code>, nofighting with which TeX distribution shipped which version of whichpackage. The <code>Makefile</code>‘s <code>make paper</code> target is one line.</p><h2 id="What-I-gave-up"><a href="#What-I-gave-up" class="headerlink" title="What I gave up"></a>What I gave up</h2><p><strong>Citation styles.</strong> Typst’s bibliography system handles Chicagoauthor-date out of the box (which is what I use), but if yourjournal requires an obscure custom <code>.bst</code> file, you may still wantLaTeX. Less of an issue for working papers than for journalsubmissions.</p><p><strong>Journal templates.</strong> Many journals provide LaTeX templates and noTypst equivalent. Not a constraint for a working paper hosted on apersonal site, but if you are submitting to JFE on day one, thismatters.</p><p><strong>Ecosystem maturity.</strong> TikZ, pgfplots, and the long tail of LaTeXpackages have no Typst equivalent yet. For a paper with simpletables and externally generated PDF figures (as mine is), thisnever bit me. For a paper that relies on intricate in-documentdiagrams, your mileage will vary.</p><h2 id="Should-you-switch"><a href="#Should-you-switch" class="headerlink" title="Should you switch?"></a>Should you switch?</h2><p>A short decision tree:</p><ul><li><strong>Working paper, preprint, personal site, blog series:</strong> switch.The compile speed alone changes how you write.</li><li><strong>Thesis, technical report, internal document:</strong> switch. Samereasons. The tooling pays back the migration cost within aweek.</li><li><strong>Journal submission to a venue that mandates a LaTeX template:</strong>stay with LaTeX for the final submission. You can still draft inTypst and port at the end if the prose-to-typesetting ratiojustifies it; for shorter papers it usually doesn’t.</li></ul><p>For QMJ-TSX, the calculus was easy: working paper, hosted onGitHub, regenerated by <code>make paper</code> as part of the pipeline, nojournal constraints. Typst was strictly better on every axis Icared about.</p><h2 id="One-thing-that-surprised-me"><a href="#One-thing-that-surprised-me" class="headerlink" title="One thing that surprised me"></a>One thing that surprised me</h2><p>I write <em>more</em> in Typst than I did in LaTeX, because revising ischeaper. A revised sentence in LaTeX implies a 2–5 second compileand possibly a chain of <code>\ref</code> warnings to chase. A revised sentencein Typst is invisible — the preview updates as you type. The costof editing collapsing toward zero changes how willing you are torewrite a paragraph.</p><p>That alone is probably worth the switch.</p><hr><p><em>Source for the paper:<a href="https://github.com/faketut/qmj-tsx/blob/main/paper/main.typ">paper&#x2F;main.typ on GitHub</a>.Compile with <code>typst compile paper/main.typ</code> or <code>make paper</code>.</em></p>]]>
      </content:encoded>
    </item>
    <item>
      <title>What yfinance Survivorship Does to a TSX Small-Cap Backtest</title>
      <link>https://faketut.github.io/2026/06/08/qmj-07-yfinance-survivorship-tsx/</link>
      <description>
        <![CDATA[<p>In the <a href="https://github.com/faketut/qmj-tsx">QMJ-TSX paper</a> I flag
yfinance survivorship as a likely contaminant of the paper-Q]]>
      </description>
      <author>Jian Feng</author>
      <category domain="https://faketut.github.io/categories/research/">research</category>
      <category domain="https://faketut.github.io/categories/research/data/">data</category>
      <category domain="https://faketut.github.io/tags/qmj-tsx/">qmj-tsx</category>
      <category domain="https://faketut.github.io/tags/canadian-equities/">canadian-equities</category>
      <category domain="https://faketut.github.io/tags/data-quality/">data-quality</category>
      <category domain="https://faketut.github.io/tags/survivorship-bias/">survivorship-bias</category>
      <category domain="https://faketut.github.io/tags/yfinance/">yfinance</category>
      <category domain="https://faketut.github.io/tags/backtesting/">backtesting</category>
      <category domain="https://faketut.github.io/tags/tsx/">tsx</category>
      <pubDate>Mon, 08 Jun 2026 01:00:00 GMT</pubDate>
      <content:encoded>
        <![CDATA[<p>In the <a href="https://github.com/faketut/qmj-tsx">QMJ-TSX paper</a> I flagyfinance survivorship as a likely contaminant of the paper-Qextension’s results. This post unpacks what that actually means,why it’s worse for small-caps than for large-caps, and what anhonest researcher with a free-data constraint can do about it.</p><p>The short version: if you build a TSX small-cap universe from“tickers that still trade today” and then backtest over 2011–2025,your historical universe is silently missing every delisting,every reverse-merger reshuffle, and every junior-resource zero.Whatever strategy you test will look better than it would have onthe real cross-section that was investable in 2011.</p><h2 id="What-yfinance-gives-you"><a href="#What-yfinance-gives-you" class="headerlink" title="What yfinance gives you"></a>What yfinance gives you</h2><p>yfinance is a Yahoo Finance scraper. For any ticker thatcurrently has a Yahoo page (e.g. <code>XYZ.TO</code> for TSX, <code>XYZ.V</code> for TSXVenture), it returns historical daily OHLCV back to the listingdate. That is genuinely useful and has the unbeatable property ofbeing free.</p><p>What it does <em>not</em> give you:</p><ol><li><strong>Delisted tickers.</strong> A ticker that traded on TSX in 2014 anddelisted in 2017 — for any reason — generally has no Yahoopage today. yfinance returns “no data” or silently skips.</li><li><strong>Reverse-merged or renamed tickers</strong> without a careful symboltrail. A junior that was acquired, reverse-merged, or rolledinto a SPAC becomes effectively invisible.</li><li><strong>Point-in-time index membership.</strong> “The TSX small-cap universein 2014” is not a thing yfinance can tell you. The best you cando is “tickers in the small-cap bucket <em>today</em> that have datagoing back to 2014,” which is exactly the survivorship trap.</li></ol><p>For US large-caps, the gap between (1)–(3) and reality is smallenough that yfinance is a defensible free data source. For TSXsmall&#x2F;mid-caps it is structurally large.</p><h2 id="Why-small-caps-are-the-worst-case"><a href="#Why-small-caps-are-the-worst-case" class="headerlink" title="Why small-caps are the worst case"></a>Why small-caps are the worst case</h2><p>Three reasons compound:</p><p><strong>Base rate of delisting is high.</strong> Junior energy, mining, andbiotech names — which dominate the TSX small&#x2F;mid-cap universe — failat rates that have nothing in common with S&amp;P 500 attrition. Auniverse drawn from “what survived” is not a random subsample of“what was investable”; it is the right tail.</p><p><strong>Index reconstitutions are large and frequent.</strong> The TSXsmall&#x2F;mid-cap bucket has meaningful turnover every year. Even ifyou somehow had a perfect snapshot today, projecting it backwardimplies an unrealistic constancy of membership.</p><p><strong>The risk being measured is asymmetric.</strong> This is the one thatmatters most for a Quality &#x2F; Safety-style strategy. A “lowvolatility” name that quietly delisted in 2018 doesn’t show up inyour backtest’s loss distribution. The survivors you do test oninclude disproportionately many names whose volatility <em>was</em> low<em>because</em> they didn’t blow up. Your backtest gets the reward ofdefensiveness without paying the tail cost. The whole construct’spremium comes from avoiding tail events, so this is exactly theworst place to have survivorship.</p><h2 id="What-it-does-to-paper-Q-specifically"><a href="#What-it-does-to-paper-Q-specifically" class="headerlink" title="What it does to paper-Q specifically"></a>What it does to paper-Q specifically</h2><p>The paper-Q long-short on TSX small&#x2F;mid-caps over 2011-12 to 2025-11has a full-sample Sharpe near zero. My honest assessment is that thetrue number — on a survivorship-corrected universe — is probably<em>worse</em>, not better, for a structural reason:</p><ul><li>A defensiveness-tilted long leg benefits most from removing theworst-performing junior resource and biotech names.</li><li>yfinance survivorship removes exactly those names from theuniverse.</li><li>So the <em>long</em> leg of paper-Q is the part most contaminated bysurvivorship; the short leg less so.</li><li>Removing survivorship would tend to <em>hurt</em> the long leg’s measuredSharpe and improve the short leg’s.</li></ul><p>Net direction on a long-short: roughly negative. The null resultlikely understates how badly the strategy actually performs.</p><p>That is an uncomfortable thing to say in a paper, which is why Isay it. The pre-COVID +0.47 Sharpe is the one I trust least onthis account: the bull run of 2011–2019 generated a lot of namesthat quietly disappeared by 2025 and are missing from my universetoday.</p><h2 id="What-you-can-do-with-a-free-data-constraint"><a href="#What-you-can-do-with-a-free-data-constraint" class="headerlink" title="What you can do with a free-data constraint"></a>What you can do with a free-data constraint</h2><p>Two practical interventions worth the effort, two not worth it:</p><p><strong>Worth it.</strong></p><ol><li><strong>Snapshot the universe at multiple historical dates if youcan.</strong> Archive.org and historical TSX&#x2F;TMX bulletins sometimespreserve old constituent lists. Even three or four historicalsnapshots, used as additional “as-of” universes, expose howmuch the surviving-today list misses.</li><li><strong>Report the cross-section size over time.</strong> If your 2011“universe” has the same 109 tickers as your 2025 universe, youare not running a 2011 backtest — you are running a 2025backtest on 2011 prices. Putting the cross-section count on achart per month makes this visible to readers.</li></ol><p><strong>Not worth it for a working paper.</strong></p><ol start="3"><li><strong>Don’t fake a survivorship correction.</strong> Without delistingprices and dates, you cannot impute returns for missing tickershonestly. A made-up −90% return on a delisted name isresearch fraud in a coat.</li><li><strong>Don’t buy CRSP &#x2F; Compustat just for this.</strong> If you can,great. But the right move for a free-data paper is to <em>flagthe limitation honestly</em> and design the conclusions around it,not to pretend you have data you don’t.</li></ol><h2 id="The-honest-framing"><a href="#The-honest-framing" class="headerlink" title="The honest framing"></a>The honest framing</h2><p>The paper’s framing reflects the constraint. The headline claim isnot “paper-Q doesn’t work on TSX small-caps.” It is “in auniverse constructed from currently-listed TSX small-caps withfree price data, a fundamentals-free price proxy fails to recoverthe AQR QMJ-Canada premium, and the failure is concentrated inthe post-COVID low-volatility unwind.” Every clause in thatsentence is true on the data I actually have.</p><p>If you are doing a free-data quant project: name your datalimitations specifically, and design your claims so they survivethe limitations being real. That is the cheap version ofacademic honesty, and it is much more useful than a confidentclaim built on data that can’t support it.</p><hr><p><em>Universe file:<a href="https://github.com/faketut/qmj-tsx/blob/main/data/raw/universe/tsx_smallcap.csv"><code>data/raw/universe/tsx_smallcap.csv</code></a>.A companion post on<a href="2026-07-25-building-a-free-data-universe.md">how that universe was built</a>covers the construction in more detail.</em></p>]]>
      </content:encoded>
    </item>
    <item>
      <title>Building a Free-Data Canadian Small-Cap Universe: 109 Tickers, Three Sources, Zero Subscriptions</title>
      <link>https://faketut.github.io/2026/06/08/qmj-08-building-a-free-data-universe/</link>
      <description>
        <![CDATA[<p>This is the last post in the <a href="README.md">QMJ-TSX series</a>. It’s the
most operational of the lot: how I assembled the data layer]]>
      </description>
      <author>Jian Feng</author>
      <category domain="https://faketut.github.io/categories/engineering/">engineering</category>
      <category domain="https://faketut.github.io/categories/engineering/data/">data</category>
      <category domain="https://faketut.github.io/tags/qmj-tsx/">qmj-tsx</category>
      <category domain="https://faketut.github.io/tags/canadian-equities/">canadian-equities</category>
      <category domain="https://faketut.github.io/tags/yfinance/">yfinance</category>
      <category domain="https://faketut.github.io/tags/data-engineering/">data-engineering</category>
      <category domain="https://faketut.github.io/tags/aqr/">aqr</category>
      <category domain="https://faketut.github.io/tags/ken-french/">ken-french</category>
      <category domain="https://faketut.github.io/tags/universe-construction/">universe-construction</category>
      <pubDate>Mon, 08 Jun 2026 01:00:00 GMT</pubDate>
      <content:encoded>
        <![CDATA[<p>This is the last post in the <a href="README.md">QMJ-TSX series</a>. It’s themost operational of the lot: how I assembled the data layer for aCanadian small-cap factor paper using only free, public sources, andwhat that constraint forced me to accept.</p><p>The pitch is simple. Three data sources, all free, all on thepublic internet, all cacheable as parquet:</p><ul><li><strong>Prices.</strong> Yahoo Finance via <code>yfinance</code> (<code>.TO</code> for TSX, <code>.V</code> forTSX Venture).</li><li><strong>AQR factor series.</strong> <a href="https://www.aqr.com/Insights/Datasets">AQR Datasets</a>— QMJ + BAB Equity monthly, all countries including Canada.</li><li><strong>Fama–French factors.</strong> <a href="https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html">Ken French data library</a>— Developed 5-factor + Developed momentum, monthly.</li></ul><p>And one hand-curated universe file: 109 TSX small&#x2F;mid-cap tickersin <code>data/raw/universe/tsx_smallcap.csv</code>.</p><p>That’s it. No Bloomberg, no Refinitiv, no CRSP, no Compustat, noS&amp;P Capital IQ. Whether that is sufficient depends on what youwant to claim — which is the rest of the post.</p><h2 id="The-three-sources-briefly"><a href="#The-three-sources-briefly" class="headerlink" title="The three sources, briefly"></a>The three sources, briefly</h2><h3 id="yfinance-for-prices"><a href="#yfinance-for-prices" class="headerlink" title="yfinance for prices"></a>yfinance for prices</h3><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> yfinance <span class="keyword">as</span> yf</span><br><span class="line">df = yf.download(<span class="string">&quot;SU.TO&quot;</span>, start=<span class="string">&quot;2011-01-01&quot;</span>, interval=<span class="string">&quot;1mo&quot;</span>)</span><br></pre></td></tr></table></figure><p>Cached to <code>data/raw/prices/{ticker}.parquet</code>. Monthly is sufficientfor a factor paper at this horizon and keeps the cache tiny(~1 MB total for 109 tickers). The <code>.TO</code> suffix is mandatory forTSX names; <code>.V</code> for Venture. Without the suffix you’ll silentlyget the US listing of a same-symbol unrelated company, which isits own special failure mode.</p><p>Honest limitations:</p><ul><li>Survivorship (see<a href="2026-07-18-yfinance-survivorship-tsx.md">the previous post</a>).</li><li>Adjusted close handling is yfinance’s, not yours. For monthlyrebalanced long-shorts this is fine; for intraday strategies itis not.</li><li>Some thinly-traded <code>.V</code> names have suspect prints. The 109-tickeruniverse was filtered partly on data sanity.</li></ul><h3 id="AQR-datasets-for-the-benchmark"><a href="#AQR-datasets-for-the-benchmark" class="headerlink" title="AQR datasets for the benchmark"></a>AQR datasets for the benchmark</h3><p>AQR publishes country-level QMJ and BAB series as monthly CSVs.The QMJ-Canada column is the benchmark for the entire paper:without it I have nothing to replicate against. The CSV layout isstable across releases — date column, country columns, one row permonth — and the file is small enough to vendor under<code>data/raw/aqr/</code>.</p><p>The replication gate (Sharpe within 0.30 of AFP 2019 Table II) isdefined against this series. The cross-check against Ken Frenchexists precisely because <em>both</em> benchmark series are public anddisagree slightly on what “the Canadian factor cross-section” is.</p><h3 id="Ken-French-for-the-cross-check-factor-library"><a href="#Ken-French-for-the-cross-check-factor-library" class="headerlink" title="Ken French for the cross-check factor library"></a>Ken French for the cross-check factor library</h3><p>The <a href="https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html">Developed 5-factor</a>plus Developed momentum, monthly. Canada is too small a market tohave its own French-style factor library, so Developed-region isthe standard substitute when cross-checking a Canadian series. TheRMW (operating profitability) factor in this library is the onethat QMJ-CAN should load on if the construct is intact; in thepaper it loads at β &#x3D; +0.61 (<em>t</em> &#x3D; 4.16). That cross-check is whatturns the replication from “the numbers approximately match” into“the construct approximately matches.”</p><h2 id="The-universe-file"><a href="#The-universe-file" class="headerlink" title="The universe file"></a>The universe file</h2><p><code>data/raw/universe/tsx_smallcap.csv</code> is a hand-curated list of 109tickers spanning the TSX small&#x2F;mid-cap range circa late 2025. Theconstruction principles, in rough order:</p><ol><li><strong>Currently listed on TSX or TSX Venture</strong> with a Yahoo ticker.This is the survivorship-introducing step. There is no freealternative.</li><li><strong>Market cap roughly in the small&#x2F;mid range.</strong> No hardcutoff — TSX small-cap definitions vary across providers and Iwasn’t going to invent one. Names that were unambiguouslylarge-cap (the big banks, the integrated energy majors) wereexcluded.</li><li><strong>At least ~10 years of monthly data</strong> in yfinance. This iswhat bounded the sample to 2011-12 onward. Newer listings wereexcluded so that the cross-section per month was reasonablystable.</li><li><strong>Sector diversity within what TSX actually is.</strong> Which is tosay: a lot of energy and materials, some industrials, sometech and healthcare, very little consumer. The universereflects the index’s actual sectoral skew rather than fightingit.</li><li><strong>Sanity-check on prices.</strong> Tickers with obvious datapathologies in yfinance (long flat stretches, single-printspikes, missing months) were dropped at curation time.</li></ol><p>None of those steps is forecast-aware — none of them requiredpeeking at returns. But step 1 is the survivorship door, and I’mupfront about it in the paper.</p><h2 id="Why-no-fundamentals"><a href="#Why-no-fundamentals" class="headerlink" title="Why no fundamentals?"></a>Why no fundamentals?</h2><p>The AQR QMJ construction uses gross profitability, accruals,leverage, payout ratios — accounting fundamentals at point-in-timefidelity for the entire cross-section. The free-data options forCanadian small-caps are:</p><ul><li><strong>SEDAR+ filings.</strong> Available, but unparsed and inconsistent.Parsing PDF financial statements at production quality for 100+small-cap tickers is a separate paper’s worth of engineering.</li><li><strong>Yahoo Finance fundamentals.</strong> Available via <code>yfinance</code>, butpoint-in-time-incorrect (the values are as-restated, notas-reported on the original filing date). Using them wouldintroduce look-ahead bias.</li><li><strong>SimFin &#x2F; EOD historical data.</strong> Coverage of TSX small-caps isthin and gappy. I checked.</li></ul><p>The cost of getting fundamentals right for this universe isroughly an order of magnitude greater than the cost of doingeverything else combined. That is what drove the entire paper-Q(“price-based proxy”) detour: the negative result on paper-Q isalso a measurement of how far you can get <em>without</em> paying thatcost. Answer: not as far as you’d hope.</p><h2 id="What-it-costs-to-do-this-right"><a href="#What-it-costs-to-do-this-right" class="headerlink" title="What it costs to do this right"></a>What it costs to do this right</h2><p>If I were redoing this with a budget, the priority order would be:</p><ol><li><strong>Point-in-time Canadian fundamentals</strong> (Compustat, FactSet,or equivalent). Unlocks the actual AFP construction. Highestmarginal value.</li><li><strong>Delisting prices and dates.</strong> Kills the survivorship caveatfrom the <a href="2026-07-18-yfinance-survivorship-tsx.md">previous post</a>.Second-highest marginal value.</li><li><strong>Index constituent histories</strong> (S&amp;P&#x2F;TSX SmallCap or TMXequivalent, monthly). Lets the universe be reconstitutedpoint-in-time rather than as a hand-curated snapshot.</li></ol><p>You can publish a credible free-data paper without (1)–(3),provided you scope the claims to what the data actually supports.That is what this project tried to do.</p><h2 id="Closing-the-series"><a href="#Closing-the-series" class="headerlink" title="Closing the series"></a>Closing the series</h2><p>If you’ve read all eight posts: thank you, that’s more attentionthan most academic papers get. The whole project — paper, code,data manifests, blog series — is at<a href="https://github.com/faketut/qmj-tsx">github.com&#x2F;faketut&#x2F;qmj-tsx</a>.<code>make all</code> regenerates the paper from a clean clone. The pullrequest template is open if you want to extend the universe, swapin a better data source, or rerun the per-component decompositionon a different market.</p><p>The single sentence I’d leave you with, across the whole series:<strong>a pre-registered null result on a free-data universe, with thedecomposition that explains the null, is a more honest researchcontribution than a positive result you can’t reproduce.</strong></p><hr><p><em>Series index: <a href="README.md">README.md</a>.</em></p>]]>
      </content:encoded>
    </item>
    <item>
      <title>PCA as a Diagnostic, Not a Rescue</title>
      <link>https://faketut.github.io/2026/06/07/qmj-03-pca-as-diagnostic-not-rescue/</link>
      <description>
        <![CDATA[<p>This is the third post in the <a href="README.md">QMJ-TSX series</a>. The
<a href="2026-06-13-low-vol-unwind-hiding-in-a-composite.md">pr]]>
      </description>
      <author>Jian Feng</author>
      <category domain="https://faketut.github.io/categories/research/">research</category>
      <category domain="https://faketut.github.io/categories/research/quant/">quant</category>
      <category domain="https://faketut.github.io/tags/qmj-tsx/">qmj-tsx</category>
      <category domain="https://faketut.github.io/tags/quant/">quant</category>
      <category domain="https://faketut.github.io/tags/signal-design/">signal-design</category>
      <category domain="https://faketut.github.io/tags/pca/">pca</category>
      <category domain="https://faketut.github.io/tags/dimensionality-reduction/">dimensionality-reduction</category>
      <pubDate>Sun, 07 Jun 2026 01:00:00 GMT</pubDate>
      <content:encoded>
        <![CDATA[<p>This is the third post in the <a href="README.md">QMJ-TSX series</a>. The<a href="2026-06-13-low-vol-unwind-hiding-in-a-composite.md">previous post</a>showed that four of the five components of my paper-Q composite aremechanically the same low-volatility signal, all flipping signpost-COVID; the fifth (rolling Sharpe) behaves differently. A naturalfollow-up is: just throw PCA at it.</p><p>This post is about what that actually does — and what it doesn’t do.</p><h2 id="The-premise"><a href="#The-premise" class="headerlink" title="The premise"></a>The premise</h2><p>If your components are near-collinear, equal-weighting them is wrongboth in theory (it understates the effective number of bets) and inpractice (it dilutes whichever component is actually unique). Thedisciplined fix is to project the components onto an orthogonal basisand price each axis separately.</p><p>Concretely: stack the five-component panel as a $(\text{date} \times\text{ticker}) \times 5$ matrix, take the principal components, andrun each PC as its own long-short.</p><h2 id="What-PCA-finds"><a href="#What-PCA-finds" class="headerlink" title="What PCA finds"></a>What PCA finds</h2><table><thead><tr><th>PC</th><th align="right">Variance explained</th><th>Interpretation</th></tr></thead><tbody><tr><td>PC1</td><td align="right">60%</td><td>Roughly uniform positive loadings across all five components — a clean general-defensive axis.</td></tr><tr><td>PC2</td><td align="right">22%</td><td>A contrast between rolling Sharpe (−0.80) and the beta-flavoured direction (+0.53). The momentum-vs-low-vol split that the horse race already surfaced.</td></tr></tbody></table><p>This is the encouraging part. PCA recovers exactly the structure theper-component decomposition already suggested: one dominant low-volfactor, plus a second axis that is essentially “rolling Sharpe minusthe rest.” The five-dimensional design space is reallytwo-dimensional, and the two dimensions have economicinterpretations.</p><h2 id="What-PCA-does-not-find"><a href="#What-PCA-does-not-find" class="headerlink" title="What PCA does not find"></a>What PCA does not find</h2><p>Run each PC as a standalone VW tercile long-short, same 10 bps costmodel:</p><table><thead><tr><th>Signal</th><th align="right">Full Sharpe</th><th align="right">Pre-COVID</th><th align="right">Post-COVID</th></tr></thead><tbody><tr><td>PC1 (low-vol axis)</td><td align="right">−0.23</td><td align="right">+0.34</td><td align="right">−0.85</td></tr><tr><td>PC2 (momentum-vs-low-vol)</td><td align="right">−0.11</td><td align="right">−0.14</td><td align="right">−0.08</td></tr><tr><td>2-PC composite (EW)</td><td align="right">−0.16</td><td align="right">—</td><td align="right">—</td></tr><tr><td>Unorthogonalised paper-Q</td><td align="right">+0.03</td><td align="right">+0.47</td><td align="right">−0.60</td></tr></tbody></table><p>Three things to notice.</p><p><strong>One.</strong> <strong>PC1 reproduces the regime flip cleanly.</strong> Pre-COVID +0.34,post-COVID −0.85. This is the smoking gun for the<a href="2026-06-13-low-vol-unwind-hiding-in-a-composite.md">previous post’s</a>claim that the post-pandemic break is a one-factor phenomenon, notan artefact of equal-weighting correlated proxies. When you collapsethe five proxies onto their dominant common axis, the regime storybecomes <em>more</em> visible, not less.</p><p><strong>Two.</strong> <strong>PC2 has no pricing content.</strong> Sharpe is essentially zero inevery subperiod. So the +0.32 full-sample Sharpe of standalonerolling Sharpe from the horse race was in fact largely riding itspositive correlation with the PC1 low-vol axis. Once thatcorrelation is purged, the residual rolling-Sharpe-minus-low-volcontrast does not price on its own in this universe.</p><p><strong>Three.</strong> <strong>Equal-weighting the two PCs underperforms theunorthogonalised composite.</strong> This is the expected consequence of(1) and (2): averaging a pricing axis with a non-pricing axisdilutes signal. The “clean” orthogonal composite is <em>worse</em> than thenaive average it was supposed to fix.</p><h2 id="The-takeaway"><a href="#The-takeaway" class="headerlink" title="The takeaway"></a>The takeaway</h2><p>PCA didn’t rescue paper-Q. What it did was something more usefulfor an honest paper: it collapsed the post-COVID story into a singledimension. The TSX small&#x2F;mid-cap low-volatility long-short broke in2020 and has not recovered. That is a cleaner, more falsifiableclaim than “an equal-weighted composite of five price-derived Safetyproxies has a null Sharpe.”</p><p>Two generalisable lessons:</p><ol><li><strong>PCA can explain a strategy without saving it.</strong> If yourcomponents are collinear, PCA tells you what the underlyingdimensions are. Whether <em>those dimensions</em> price is an empiricalquestion PCA cannot answer for you. Don’t confuse “I nowunderstand my signal” with “my signal works.”</li><li><strong>Don’t equal-weight PCs either.</strong> Same trap as equal-weightingraw components, one level up. If PC1 prices and PC2 doesn’t, youwant PC1, not their average. Variance-explained is not a proxyfor pricing content.</li></ol><p>The honest version of “throw PCA at it” is: use PCA as a<em>diagnostic</em> for what the signal actually is, then make a separate,deliberate decision about which axes (if any) to trade.</p><hr><p><em>All PCA outputs, loading tables, and PC long-short Sharpes are inthe paper’s robustness section and regenerate from <code>make robust</code>.Code: <a href="https://github.com/faketut/qmj-tsx">github.com&#x2F;faketut&#x2F;qmj-tsx</a>.</em></p>]]>
      </content:encoded>
    </item>
    <item>
      <title>How I Made a Quant Paper Reproducible in `make all` Under a Minute</title>
      <link>https://faketut.github.io/2026/06/07/qmj-04-make-all-under-a-minute/</link>
      <description>
        <![CDATA[<p>The QMJ-TSX project has a hard constraint baked into the design: a
fresh clone, on a normal laptop, with no subscriptions, should
regener]]>
      </description>
      <author>Jian Feng</author>
      <category domain="https://faketut.github.io/categories/engineering/">engineering</category>
      <category domain="https://faketut.github.io/categories/engineering/quant/">quant</category>
      <category domain="https://faketut.github.io/tags/qmj-tsx/">qmj-tsx</category>
      <category domain="https://faketut.github.io/tags/quant/">quant</category>
      <category domain="https://faketut.github.io/tags/reproducibility/">reproducibility</category>
      <category domain="https://faketut.github.io/tags/python/">python</category>
      <category domain="https://faketut.github.io/tags/uv/">uv</category>
      <category domain="https://faketut.github.io/tags/typst/">typst</category>
      <category domain="https://faketut.github.io/tags/tooling/">tooling</category>
      <pubDate>Sun, 07 Jun 2026 01:00:00 GMT</pubDate>
      <content:encoded>
        <![CDATA[<p>The QMJ-TSX project has a hard constraint baked into the design: afresh clone, on a normal laptop, with no subscriptions, shouldregenerate every number in the paper — and the paper PDF itself — inunder a minute. This post is about how the project meets thatconstraint and why it was worth treating reproducibility as a <em>designparameter</em> rather than a politeness.</p><h2 id="The-acceptance-test"><a href="#The-acceptance-test" class="headerlink" title="The acceptance test"></a>The acceptance test</h2><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">git <span class="built_in">clone</span> https://github.com/faketut/qmj-tsx</span><br><span class="line"><span class="built_in">cd</span> qmj-tsx</span><br><span class="line">uv <span class="built_in">sync</span></span><br><span class="line">make all</span><br></pre></td></tr></table></figure><p>If, at the end of <code>make all</code>, <code>paper/main.pdf</code> exists and itsheadline numbers match the version on GitHub, the project isworking. That is the acceptance test, and CI enforces it. Everythingbelow is in service of keeping that loop short and unambiguous.</p><h2 id="The-four-pieces"><a href="#The-four-pieces" class="headerlink" title="The four pieces"></a>The four pieces</h2><h3 id="1-uv-for-the-Python-environment"><a href="#1-uv-for-the-Python-environment" class="headerlink" title="1. uv for the Python environment"></a>1. <code>uv</code> for the Python environment</h3><p><code>uv</code> replaces <code>pip + venv + pip-tools</code> with a single fast resolver.<code>uv sync</code> reads <code>pyproject.toml</code> and <code>uv.lock</code>, builds a hermeticvenv, and is done in seconds on a warm cache. There is no<code>requirements.txt</code>, no Conda, no Docker. Two reasons this matters:</p><ul><li>A reader who is bouncing off your repo will <em>not</em> install Conda orDocker to read your paper. They will close the tab.</li><li>A locked resolver means the numbers I report today will stillresolve to the same library versions in two years. That is thewhole point of a lock file.</li></ul><h3 id="2-make-as-the-command-surface"><a href="#2-make-as-the-command-surface" class="headerlink" title="2. make as the command surface"></a>2. <code>make</code> as the command surface</h3><p>The <code>Makefile</code> is the canonical entry point:</p><table><thead><tr><th>Target</th><th>Produces</th></tr></thead><tbody><tr><td><code>make data</code></td><td>Cached parquets: prices, AQR benchmarks, Ken French FF5+UMD</td></tr><tr><td><code>make signals</code></td><td>paper-Q monthly panel</td></tr><tr><td><code>make backtest</code></td><td>Long–short returns + summary</td></tr><tr><td><code>make robust</code></td><td>Headline sweep, sector-exclusion, per-component horse race</td></tr><tr><td><code>make figures</code></td><td><code>paper/figures/cumret.pdf</code></td></tr><tr><td><code>make paper</code></td><td><code>paper/main.pdf</code> (via Typst)</td></tr><tr><td><code>make all</code></td><td>All of the above</td></tr><tr><td><code>make test</code></td><td>Unit + invariant tests</td></tr></tbody></table><p><code>make</code> is not glamorous, but it is the lowest-common-denominatorbuild tool. Everyone has it. Targets compose. Failed targets stopthe pipeline at the failure site, which is exactly what you wantfor a research build.</p><h3 id="3-A-typed-CLI-surface-not-notebook-cells"><a href="#3-A-typed-CLI-surface-not-notebook-cells" class="headerlink" title="3. A typed CLI surface, not notebook cells"></a>3. A typed CLI surface, not notebook cells</h3><p>Underneath <code>make</code>, every step is a <code>qmj</code> subcommand:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">qmj data prices                   <span class="comment"># yfinance monthly parquets</span></span><br><span class="line">qmj data benchmarks               <span class="comment"># AQR QMJ/BAB-Canada</span></span><br><span class="line">qmj data ken-french               <span class="comment"># FF5-DEV + UMD-DEV</span></span><br><span class="line">qmj replicate                     <span class="comment"># AQR QMJ-CAN baseline + FF5 cross-check</span></span><br><span class="line">qmj signals paper-q               <span class="comment"># price-based Quality panel</span></span><br><span class="line">qmj backtest                      <span class="comment"># long–short portfolio + summary</span></span><br><span class="line">qmj robust                        <span class="comment"># weighting × buckets × subperiod × cost sweep</span></span><br><span class="line">qmj figure cumret                 <span class="comment"># cumulative-return figure</span></span><br></pre></td></tr></table></figure><p>The CLI exists so that <em>every</em> number in the paper has adeterministic, single-command provenance. The number for thepost-COVID Sharpe came from <code>qmj robust</code>, not from a notebook cell Iran in some order I can no longer remember. Notebooks are great forexploration and terrible for archival. Promote anything you intendto <em>cite</em> into a CLI command.</p><h3 id="4-Parquet-caches-under-data"><a href="#4-Parquet-caches-under-data" class="headerlink" title="4. Parquet caches under data/"></a>4. Parquet caches under <code>data/</code></h3><p>Raw downloads (yfinance prices, AQR CSVs, Ken French ZIPs) land in<code>data/raw/</code> and are gitignored. Processed monthly panels areparquet under <code>data/processed/</code>. Steps downstream of <code>data</code> arefully offline. Two practical wins:</p><ul><li><code>make all</code> after <code>make data</code> runs in seconds because nothingre-hits the network.</li><li>A future reader whose internet is broken (or whose data sourcehas rotted) can still reproduce everything from the releasedparquet bundle.</li></ul><h2 id="The-paper-compiles-too"><a href="#The-paper-compiles-too" class="headerlink" title="The paper compiles too"></a>The paper compiles too</h2><p>The paper is in Typst (<code>paper/main.typ</code> + <code>paper/sections/*.typ</code> +<code>paper/tables/*.typ</code>). <code>make paper</code> runs <code>typst compile</code> on it andproduces <code>paper/main.pdf</code>. There is no separate “build the paper”ritual disconnected from “build the numbers.” The same <code>make all</code>that regenerates the backtest also re-compiles the paper that citesthe backtest. (More on the Typst choice in a<a href="2026-07-11-typst-instead-of-latex.md">later post</a>.)</p><h2 id="What-this-buys-you"><a href="#What-this-buys-you" class="headerlink" title="What this buys you"></a>What this buys you</h2><p>Three things that compound:</p><ol><li><strong>Reviewers and readers can verify you.</strong> Anyone who suspects anumber can reproduce it without asking me a single question.That is — and this is the dirty secret of empirical finance — farfrom the default.</li><li><strong>Future-you can extend without archaeology.</strong> Six months fromnow, when I want to add a new robustness cell, I add a CLIsubcommand and a <code>make</code> target. I do not re-derive what <code>paper-Q</code>was.</li><li><strong>The repo is its own demo.</strong> A hiring manager reading theREADME sees the acceptance test and either runs it or doesn’t.Either way the bar is concrete.</li></ol><h2 id="What-I-would-skip-if-you’re-starting-from-scratch"><a href="#What-I-would-skip-if-you’re-starting-from-scratch" class="headerlink" title="What I would skip if you’re starting from scratch"></a>What I would skip if you’re starting from scratch</h2><ul><li><strong>Don’t bother with Docker</strong> for a project this size. <code>uv</code> pluspinned Python in <code>pyproject.toml</code> is enough.</li><li><strong>Don’t ship notebooks as primary deliverables.</strong> Ship a CLI and a<code>make</code> target. A notebook can be a <em>demo</em> of the CLI; it cannot bethe canonical source of any number that ends up in your paper.</li><li><strong>Don’t over-engineer the data layer.</strong> Parquet files in a flat<code>data/processed/</code> directory, named after what produced them. Nodatabase. No DVC. You can graduate to those when the datasetoutgrows a laptop.</li></ul><p>The whole pipeline is maybe 1,500 lines of Python plus a hundredlines of Typst. The point isn’t that the project is small — it’sthat reproducibility doesn’t <em>require</em> it to be large.</p><hr><p><em>Repo: <a href="https://github.com/faketut/qmj-tsx">github.com&#x2F;faketut&#x2F;qmj-tsx</a>.The acceptance test is <code>make all</code> after <code>uv sync</code>.</em></p>]]>
      </content:encoded>
    </item>
    <item>
      <title>Pre-Registering Replication Gates for a Solo Quant Project</title>
      <link>https://faketut.github.io/2026/06/07/qmj-05-pre-registering-replication-gates/</link>
      <description>
        <![CDATA[<p>The QMJ-TSX paper has two findings: a successful replication of the
AQR QMJ-Canada series, and an <em>un</em>successful extension to a
pr]]>
      </description>
      <author>Jian Feng</author>
      <category domain="https://faketut.github.io/categories/research/">research</category>
      <category domain="https://faketut.github.io/categories/research/quant/">quant</category>
      <category domain="https://faketut.github.io/tags/qmj-tsx/">qmj-tsx</category>
      <category domain="https://faketut.github.io/tags/quant/">quant</category>
      <category domain="https://faketut.github.io/tags/reproducibility/">reproducibility</category>
      <category domain="https://faketut.github.io/tags/pre-registration/">pre-registration</category>
      <category domain="https://faketut.github.io/tags/statistics/">statistics</category>
      <category domain="https://faketut.github.io/tags/research-process/">research-process</category>
      <pubDate>Sun, 07 Jun 2026 01:00:00 GMT</pubDate>
      <content:encoded>
        <![CDATA[<p>The QMJ-TSX paper has two findings: a successful replication of theAQR QMJ-Canada series, and an <em>un</em>successful extension to aprice-based proxy on TSX small-caps. The reason I’m comfortablepublishing the negative result is that I wrote down the bar for“success” <em>before</em> I ran the test.</p><p>This is what people mean by pre-registration. In medicine andpsychology it is a formal mechanism; in solo quant work it isusually just a habit, and a rare one. This post is the case foradopting it even when nobody is forcing you to.</p><h2 id="The-two-gates"><a href="#The-two-gates" class="headerlink" title="The two gates"></a>The two gates</h2><p>I committed to two numerical gates in writing before the analysis:</p><p><strong>Gate 1 — Replication tolerance.</strong> The Sharpe ratio of myrecomputed AQR QMJ-Canada series, over the comparable sample, mustfall within ±0.30 of the 0.65 figure reported in AFP (2019) Table IIfor Canada.</p><blockquote><p>Outcome: replicated Sharpe &#x3D; 0.64. Within tolerance. <strong>Pass.</strong></p></blockquote><p><strong>Gate 2 — Calibration of the extension.</strong> The Spearman rankcorrelation between my fundamentals-free paper-Q long-short and theAQR QMJ-Canada series, over the common sample, must be ≥ 0.3 for meto claim paper-Q “captures the same construct.”</p><blockquote><p>Outcome: contemporaneous correlation &#x3D; −0.03, regression β &#x3D; −0.08(<em>t</em> &#x3D; −0.38), R² ≈ 0. <strong>Fail.</strong></p></blockquote><p>These are not p-values. They are pre-committed numerical bands onthe actual quantities the paper is making claims about. Setting themin advance is the whole point.</p><h2 id="Why-this-matters-more-for-a-solo-project-not-less"><a href="#Why-this-matters-more-for-a-solo-project-not-less" class="headerlink" title="Why this matters more for a solo project, not less"></a>Why this matters more for a solo project, not less</h2><p>The standard argument for pre-registration is to defend againstresearcher degrees of freedom — the small choices (sample window,weighting, winsorisation, sector exclusions) that, taken together,let you nudge a borderline result into significance. In a <em>team</em>setting there is at least social friction against this. In a solosetting there is none. You can rerun any cell any number of times,and the only person who would notice is you.</p><p>A pre-registered gate creates artificial friction. Once it iswritten down, moving it requires you to <em>consciously</em> admit you aremoving it. That is a low bar, but it turns out to be a meaningfulone.</p><h2 id="The-asymmetry-that-makes-nulls-publishable"><a href="#The-asymmetry-that-makes-nulls-publishable" class="headerlink" title="The asymmetry that makes nulls publishable"></a>The asymmetry that makes nulls publishable</h2><p>There is a respectable version of “my strategy didn’t work” and adisrespectable one. The disrespectable version reads:</p><blockquote><p>I tried a bunch of variants. None of them were significant. I’mcalling that a negative result.</p></blockquote><p>The respectable version reads:</p><blockquote><p>I committed in advance to the following falsifiable test.The test failed in this specific way. Here is what we learn fromthe failure.</p></blockquote><p>Only the second version contains information. The first isindistinguishable from a strategy that almost worked, dressed up ashumility.</p><p>For the paper-Q work, gate 2 is what makes the null <em>informative</em>.The pre-committed claim — “if a price-based proxy captures the sameunderlying Quality construct, rank correlation with thefundamentals-based version should be at least 0.3” — is thething being tested. The observed correlation of −0.03 is not “smalland inconclusive.” It is “comprehensively below the bar I set.” Thatis publishable evidence about the limits of fundamentals-freeproxies, not a strategy I am still fishing for.</p><h2 id="How-to-actually-do-it-lightly"><a href="#How-to-actually-do-it-lightly" class="headerlink" title="How to actually do it, lightly"></a>How to actually do it, lightly</h2><p>Solo pre-registration does not require ritual. A few things thatworked for me:</p><ol><li><strong>Write the gates into the project plan, not just into yourhead.</strong> I keep them in <code>memories/session/plan.md</code> with timestamps.Anything not in writing didn’t happen.</li><li><strong>Make the gates numerical.</strong> “Reasonable replication” is not agate; “Sharpe within ±0.30 of the published number” is.</li><li><strong>State the consequence in advance.</strong> “If gate 2 fails, thepaper’s claim shifts from ‘paper-Q recovers QMJ’ to ‘paper-Qfails to recover QMJ, and here is the per-componentdecomposition that tells us why.’” The fallback analysis is partof the pre-registration, not a post-hoc rescue.</li><li><strong>Don’t tune the gate to the data.</strong> The temptation is real. Ifthe observed Spearman is 0.18 and your gate was 0.3, the answeris “fail,” not “0.15 is fine, actually.”</li><li><strong>Report the result <em>against the pre-committed gate</em></strong> in thepaper. Not just the number — the comparison to the bar.</li></ol><p>That is the whole methodology. Five sentences in a markdown fileand a discipline about not editing them after the fact.</p><h2 id="A-second-order-benefit"><a href="#A-second-order-benefit" class="headerlink" title="A second-order benefit"></a>A second-order benefit</h2><p>A pleasant side effect of gate 2 failing is that the<a href="2026-06-13-low-vol-unwind-hiding-in-a-composite.md">per-component decomposition</a>and the <a href="2026-06-20-pca-as-diagnostic-not-rescue.md">PCA analysis</a>became the most interesting parts of the paper. If the gate hadpassed I would have written a competent replication-plus-extensionpaper that nobody would have cared about. Because it failed, I wasforced to ask <em>why</em> it failed — and the answer (“a one-factorpost-COVID low-vol unwind that the composite was masking”) is thegeneralisable finding.</p><p>Pre-registration didn’t just protect me from a soft positiveresult. It pointed me at the real one.</p><hr><p><em>Paper, gates, and the full results table:<a href="https://github.com/faketut/qmj-tsx">github.com&#x2F;faketut&#x2F;qmj-tsx</a>.</em></p>]]>
      </content:encoded>
    </item>
    <item>
      <title>When a Famous Anomaly Refuses to Travel: QMJ on TSX Small-Caps</title>
      <link>https://faketut.github.io/2026/06/06/qmj-01-null-result/</link>
      <description>
        <![CDATA[<p>The Quality Minus Junk (QMJ) factor of Asness, Frazzini, and Pedersen
(2019) is one of the better-documented anomalies of the past decade]]>
      </description>
      <author>Jian Feng</author>
      <category domain="https://faketut.github.io/categories/research/">research</category>
      <category domain="https://faketut.github.io/categories/research/quant/">quant</category>
      <category domain="https://faketut.github.io/tags/qmj-tsx/">qmj-tsx</category>
      <category domain="https://faketut.github.io/tags/quant/">quant</category>
      <category domain="https://faketut.github.io/tags/factor-investing/">factor-investing</category>
      <category domain="https://faketut.github.io/tags/replication/">replication</category>
      <category domain="https://faketut.github.io/tags/canadian-equities/">canadian-equities</category>
      <pubDate>Sat, 06 Jun 2026 01:00:00 GMT</pubDate>
      <content:encoded>
        <![CDATA[<p>The Quality Minus Junk (QMJ) factor of Asness, Frazzini, and Pedersen(2019) is one of the better-documented anomalies of the past decade:high-quality firms — profitable, growing, safe, well-managed — earnpersistently higher risk-adjusted returns than low-quality firms across24 developed markets. AQR even publishes the monthly QMJ-Canada serieson its <a href="https://www.aqr.com/Insights/Datasets">datasets page</a>, so theheadline is independently verifiable by anyone with a spreadsheet.</p><p>What AQR does <em>not</em> publish is the underlying long-short on TSXsmall-caps. That universe is where I wanted to deploy the strategy —and the fundamentals AQR uses (gross profitability, accruals, leverage,payout ratios) are not free at the coverage or point-in-time fidelitythe construction requires.</p><p>So I asked a narrower question: <strong>can a price-derived proxy recover theQMJ premium on TSX small-caps?</strong> This post is the headline answer.Spoiler: no, and the <em>way</em> it fails turns out to be more interestingthan a clean replication would have been.</p><h2 id="Step-1-replicate-what-we-can-replicate"><a href="#Step-1-replicate-what-we-can-replicate" class="headerlink" title="Step 1: replicate what we can replicate"></a>Step 1: replicate what we <em>can</em> replicate</h2><p>Before extending anything, the replication gate. Using the public AQRQMJ-Canada series (1989-07 to 2026-03, 441 monthly observations):</p><table><thead><tr><th>Statistic</th><th>Value</th></tr></thead><tbody><tr><td>Annualised excess return</td><td>8.6%</td></tr><tr><td>Annualised volatility</td><td>13.4%</td></tr><tr><td>Sharpe</td><td>0.64</td></tr><tr><td>Max drawdown</td><td>−37.0%</td></tr><tr><td>Carhart-CAN 4-factor monthly α</td><td>0.70% (<em>t</em> &#x3D; 4.46)</td></tr><tr><td>→ annualised α</td><td>≈ 8.8%</td></tr></tbody></table><p>The Sharpe falls within 0.30 of the 0.65 reported in AFP 2019 Table IIfor Canada — comfortably inside my pre-registered tolerance. As anexternal cross-check, regressing the same series on Ken French’sDeveloped FF5 + momentum panel keeps α positive and significant(0.52%&#x2F;month, <em>t</em> &#x3D; 3.00) and produces the predicted loading on theprofitability factor RMW (β &#x3D; +0.61, <em>t</em> &#x3D; 4.16). The construct isintact. The published premium is real. Replication gate passed.</p><h2 id="Step-2-the-extension-that-doesn’t-work"><a href="#Step-2-the-extension-that-doesn’t-work" class="headerlink" title="Step 2: the extension that doesn’t work"></a>Step 2: the extension that doesn’t work</h2><p>To deploy on TSX small-caps without fundamentals, I built <strong>paper-Q</strong> —a fundamentals-free quality proxy from five price- and return-derivedcomponents, sign-aligned to AFP’s Safety leg:</p><ol><li>idiosyncratic volatility,</li><li>market beta,</li><li>maximum drawdown,</li><li>rolling Sharpe,</li><li>downside semi-deviation.</li></ol><p>Cross-sectionally z-scored, equal-weight composited, value-weightedtercile long-short, monthly rebalance. 109-ticker hand-curated TSXsmall&#x2F;mid-cap universe. Sample 2011-12 to 2025-11 (168 months).</p><p>Headline:</p><table><thead><tr><th>Statistic</th><th>Value</th></tr></thead><tbody><tr><td>Annualised gross return (VW)</td><td>+1.0%</td></tr><tr><td>Annualised volatility</td><td>30.6%</td></tr><tr><td>Sharpe (VW)</td><td>0.03</td></tr><tr><td>Sharpe (EW)</td><td>−0.33</td></tr><tr><td>Avg. monthly leg turnover</td><td>7.4%</td></tr></tbody></table><p>The key diagnostic — does paper-Q capture the same construct asAQR QMJ? — is also clean and disappointing. Regressing paper-Q onQMJ-CAN gives β &#x3D; −0.08 (<em>t</em> &#x3D; −0.38), R² ≈ 0, contemporaneouscorrelation <strong>−0.03</strong>. My pre-registered calibration gate (Spearmanρ ≥ 0.3) is not met. A Carhart-CAN regression of paper-Q itselfproduces an insignificant α (<em>t</em> &#x3D; 0.26).</p><p>The price-derived proxy, in this universe, is essentiallyuncorrelated with fundamentals-based Quality. Falsification.</p><h2 id="Why-the-null-is-the-result"><a href="#Why-the-null-is-the-result" class="headerlink" title="Why the null is the result"></a>Why the null is the result</h2><p>A null that you <em>pre-registered against</em> is a different object froma null you stumbled into. I committed in advance to a tolerance bandon the replication Sharpe and a calibration floor on thepaper-Q-vs-QMJ-CAN correlation. The replication passed; theextension failed. That is publishable evidence about the limits offundamentals-free proxies in resource-heavy small-cap universes,not a strategy I’m now going to fish for.</p><p>There are at least three plausible mechanisms behind the failure:</p><ol><li><strong>Sectoral contamination.</strong> Junior energy and mining namesdominate the TSX small-cap universe. The “low-volatility” legof any price-based Safety proxy ends up holding defensives whoserisk is structurally distinct from operational Quality.</li><li><strong>Accounting inputs that don’t have price analogues.</strong> Accrualsand payout ratios depend on balance-sheet flows whose priceproxies are dominated by sector exposure.</li><li><strong>Survivorship in the free data.</strong> yfinance only shows me namesthat still trade — likely biasing toward winners and blunting anydefensive premium. (Separate post coming on this.)</li></ol><h2 id="What’s-actually-interesting"><a href="#What’s-actually-interesting" class="headerlink" title="What’s actually interesting"></a>What’s actually interesting</h2><p>The full-sample null masks a clean <strong>regime break</strong> around COVID:</p><table><thead><tr><th>Period</th><th align="right">Annualised return</th><th align="right">Net Sharpe</th></tr></thead><tbody><tr><td>2011-12 → 2020-02</td><td align="right">+14.3%</td><td align="right">+0.47</td></tr><tr><td>2020-03 → 2025-11</td><td align="right">−18.1%</td><td align="right">−0.60</td></tr></tbody></table><p>That flip is what the next two posts in this series are about. Asector-exclusion cut (dropping Energy + Materials) only recoversabout a third of the post-COVID damage — so this is not purely aresource-sector story. A per-component decomposition shows thatfour of the five paper-Q components are essentially the samelow-volatility signal in different statistical clothing, and theyall turned over together. That’s the real finding hiding inside thecomposite, and it is what I think generalises beyond this paper.</p><hr><p><em>Paper, code, and reproducible pipeline:<a href="https://github.com/faketut/qmj-tsx">github.com&#x2F;faketut&#x2F;qmj-tsx</a>.<code>make all</code> regenerates every number above in under a minute on amodern laptop.</em></p>]]>
      </content:encoded>
    </item>
    <item>
      <title>A Low-Vol Unwind Hiding Inside a Composite Signal</title>
      <link>https://faketut.github.io/2026/06/06/qmj-02-low-vol-unwind-hiding-in-a-composite/</link>
      <description>
        <![CDATA[<p>This is the second post in a series on a price-based Quality factor
(“paper-Q”) I built for TSX small-caps. The headline result — that
pa]]>
      </description>
      <author>Jian Feng</author>
      <category domain="https://faketut.github.io/categories/research/">research</category>
      <category domain="https://faketut.github.io/categories/research/quant/">quant</category>
      <category domain="https://faketut.github.io/tags/qmj-tsx/">qmj-tsx</category>
      <category domain="https://faketut.github.io/tags/quant/">quant</category>
      <category domain="https://faketut.github.io/tags/factor-investing/">factor-investing</category>
      <category domain="https://faketut.github.io/tags/low-volatility/">low-volatility</category>
      <category domain="https://faketut.github.io/tags/regime-change/">regime-change</category>
      <category domain="https://faketut.github.io/tags/signal-design/">signal-design</category>
      <pubDate>Sat, 06 Jun 2026 01:00:00 GMT</pubDate>
      <content:encoded>
        <![CDATA[<p>This is the second post in a series on a price-based Quality factor(“paper-Q”) I built for TSX small-caps. The headline result — thatpaper-Q does not recover the AQR QMJ-Canada premium — is in<a href="2026-06-06-qmj-tsx-null-result.md">the previous post</a>. Here I wantto talk about what I found when I cracked the composite open.</p><p>If you build any composite signal by averaging <em>z</em>-scored components,the most boring failure mode is also the one that’s easiest tooverlook: your components secretly all measure the same thing. Anull result on the composite then tells you nothing about whetherthe underlying construct works — it just tells you that the averageof N copies of one signal is, surprise, that signal.</p><p>That is essentially what happened to paper-Q.</p><h2 id="The-setup"><a href="#The-setup" class="headerlink" title="The setup"></a>The setup</h2><p>paper-Q averages five sign-aligned, <em>z</em>-scored components:</p><ol><li>idiosyncratic volatility,</li><li>market beta,</li><li>maximum drawdown,</li><li>rolling Sharpe,</li><li>downside semi-deviation.</li></ol><p>Each is supposed to be a proxy for some part of AFP’s Safety leg.Four of them — idio vol, beta, max drawdown, downside semi-dev — arevolatility-flavoured. The fifth, rolling Sharpe, is the only onewith a price-momentum flavour.</p><p>To see which components were actually driving the composite, I ran a<strong>per-component horse race</strong>: each component as a standalonevalue-weighted tercile long-short, same 10 bps round-trip cost, threewindows (full sample, pre-COVID, post-COVID).</p><h2 id="The-result"><a href="#The-result" class="headerlink" title="The result"></a>The result</h2><p>The pattern is sharp.</p><table><thead><tr><th>Component</th><th align="right">Full Sharpe</th><th align="right">Pre-COVID</th><th align="right">Post-COVID</th></tr></thead><tbody><tr><td>Idiosyncratic vol</td><td align="right">low &#x2F; negative</td><td align="right">+ (0.12 to 0.57 band)</td><td align="right">− (−0.53 to −0.92 band)</td></tr><tr><td>Market beta</td><td align="right">low &#x2F; negative</td><td align="right">+</td><td align="right">−</td></tr><tr><td>Max drawdown</td><td align="right">low &#x2F; negative</td><td align="right">+</td><td align="right">−</td></tr><tr><td>Downside semi-dev</td><td align="right">low &#x2F; negative</td><td align="right">+</td><td align="right">−</td></tr><tr><td><strong>Rolling Sharpe</strong></td><td align="right"><strong>+0.32</strong></td><td align="right"><strong>+0.66</strong></td><td align="right"><strong>−0.10</strong></td></tr></tbody></table><p>Two things jump out.</p><p><strong>One.</strong> Four of the five components are mechanically the sameunderlying signal — the cross-section of price volatility, viewedthrough slightly different statistics. All four post positivepre-COVID Sharpes between roughly +0.12 and +0.57. All four collapseto comparable <em>negative</em> numbers post-COVID, between −0.53 and −0.92.This is one factor turning over, not four independent signalscoincidentally agreeing.</p><p><strong>Two.</strong> Rolling Sharpe — the only price-momentum-flavoured component— behaves qualitatively differently. It has the highest full-sampleSharpe in the set (+0.32), the highest pre-COVID number (+0.66), andthe shallowest post-COVID drawdown (−0.10).</p><p>So the composite’s full-sample ≈ 0 Sharpe is, mechanically, theaverage of one positive signal and four highly correlated negativeones. Equal-weighting masked the heterogeneity entirely.</p><h2 id="Why-this-matters-beyond-paper-Q"><a href="#Why-this-matters-beyond-paper-Q" class="headerlink" title="Why this matters beyond paper-Q"></a>Why this matters beyond paper-Q</h2><p>The narrow conclusion is about this strategy: the post-COVID failureof paper-Q is <strong>specifically a low-volatility unwind</strong>, not a genericbreakdown of price-based signals on TSX small-caps. Everyvolatility-flavoured price statistic in my set turned over togetherin March 2020 and has not recovered; the price-momentum componentwas comparatively unaffected.</p><p>The general conclusion is about signal construction. Two practicalrules of thumb that I’d defend more strongly now than I would havebefore this exercise:</p><ol><li><strong>Always run components standalone before you composite them.</strong>The cost is N extra backtests. The benefit is that you find outbefore publication whether you have one factor or N factors.Equal-weighting near-collinear components is not “diversification” —it’s just a noisier version of the underlying signal, with a worsestory attached.</li><li><strong>Decompose first, then design the weighting.</strong> If four of fivecomponents turn out to be one factor, the right composite weightsthem by <em>uniqueness</em>, not equally. PCA residuals are the obviousnext move — and the next post in this series will work throughwhat PCA actually does and does not buy you here. (Short version:it explains the regime break cleanly, but it doesn’t rescue thestrategy.)</li></ol><h2 id="A-meta-lesson"><a href="#A-meta-lesson" class="headerlink" title="A meta-lesson"></a>A meta-lesson</h2><p>A composite signal with five inputs and a null full-sample result<em>looks</em> like a clean negative finding. It almost wasn’t. The cleannegative finding is the per-component table above. Without it, Iwould have written a paper claiming “fundamentals-free Qualitydoesn’t work on TSX small-caps” when what I had actually shown was“a particular equal-weighted low-vol composite doesn’t work on TSXsmall-caps, in a way that says nothing about rolling Sharpe.”</p><p>Decomposition is cheap. Run it.</p><hr><p><em>Code and the full robustness battery (including the per-componenttable above): <a href="https://github.com/faketut/qmj-tsx">github.com&#x2F;faketut&#x2F;qmj-tsx</a>.Reproduce with <code>make robust</code>.</em></p>]]>
      </content:encoded>
    </item>
    <item>
      <title>Beating the reference compiler by 5×: a WLP4 → ARM64 optimization journey</title>
      <link>https://faketut.github.io/2026/05/17/cclass-blog-optimization-journey/</link>
      <description>
        <![CDATA[<blockquote>
<p><strong>TL;DR.</strong> A four-pass compiler for the CS241 teaching language WLP4, targeting a restricted ARM64 subset. Afte]]>
      </description>
      <author>Jian Feng</author>
      <category domain="https://faketut.github.io/categories/engineering/">engineering</category>
      <category domain="https://faketut.github.io/tags/c-class-compiler/">c-class-compiler</category>
      <category domain="https://faketut.github.io/tags/compiler/">compiler</category>
      <category domain="https://faketut.github.io/tags/optimization/">optimization</category>
      <pubDate>Sun, 17 May 2026 19:00:00 GMT</pubDate>
      <content:encoded>
        <![CDATA[<blockquote><p><strong>TL;DR.</strong> A four-pass compiler for the CS241 teaching language WLP4, targeting a restricted ARM64 subset. After a handful of focused codegen tricks and a CI loop that measures every push, our <code>.com</code> outputs come out <strong>−79.6%</strong> smaller than the course’s reference compiler <code>wlp4c</code> across a 65-program benchmark — every single test is smaller, ranging from −63% (heap-intensive) to −92% (pure arithmetic).</p><p>What follows is the engineering log, not just the numbers: which optimizations actually mattered, which I deliberately <em>didn’t</em> do, and why the development loop is structured the way it is.</p></blockquote><hr><h2 id="Table-of-contents"><a href="#Table-of-contents" class="headerlink" title="Table of contents"></a>Table of contents</h2><ol><li><a href="#1-setting-and-constraints">Setting and constraints</a></li><li><a href="#2-the-starting-point">The starting point</a></li><li><a href="#3-phase-1--building-a-safety-net">Phase 1 — Building a safety net</a></li><li><a href="#4-phase-2a--diagnostics-that-survive-contact-with-reality">Phase 2A — Diagnostics that survive contact with reality</a></li><li><a href="#5-the-optimizations-that-earned-their-keep">The optimizations that earned their keep</a></li><li><a href="#6-phase-5--measuring-like-we-mean-it">Phase 5 — Measuring like we mean it</a></li><li><a href="#7-phase-4--cmake--a-real-driver">Phase 4 — CMake + a real driver</a></li><li><a href="#8-what-i-deliberately-did-not-do">What I deliberately did <em>not</em> do</a></li><li><a href="#9-lessons">Lessons</a></li><li><a href="#10-appendix-full-benchmark-table">Appendix: full benchmark table</a></li></ol><hr><h2 id="1-Setting-and-constraints"><a href="#1-Setting-and-constraints" class="headerlink" title="1. Setting and constraints"></a>1. Setting and constraints</h2><p>WLP4 is a <em>very</em> small C-flavored language used in a university compilers course:</p><ul><li>Two scalar types: <code>long</code> and <code>long*</code>.</li><li>Entry point is <code>wain(a, b)</code>, not <code>main</code>.</li><li>Procedures, locals, <code>if/else</code>, <code>while</code>, <code>*p</code> &#x2F; <code>&amp;x</code>, <code>new[]</code>, <code>delete[]</code>, <code>println</code>, <code>putchar</code>, <code>getchar</code>.</li><li>No early <code>return</code>, no <code>for</code>, no structs, no globals.</li></ul><p>The target is a <strong>restricted ARM64 subset</strong> — only a curated handful of instructions are accepted by the course emulator (<code>bin_ref/arm64emu</code>): essentially <code>add/sub/mul/smulh/umulh/sdiv/udiv</code>, <code>cmp</code>, <code>b/b.cond/br/blr</code>, <code>ldur/stur</code> with 9-bit signed immediates, and a PC-relative <code>ldr xN, imm</code>. <strong>No <code>mov</code>, no <code>movz/movk</code>, no <code>ldp/stp</code>, no register-immediate add</strong>. Constants come exclusively from a PC-relative literal pool. This sounds annoying but it’s actually the source of most of the size win — see §5.3.</p><p>The “oracle” we’re racing against is <code>bin_ref/wlp4c</code>, the canonical course compiler. Both produce the same <code>.com</code> file format (header + program + relocation&#x2F;import&#x2F;export footer), both go through the same assembler <code>bin_ref/linkasm</code>, both are run under the same <code>arm64emu</code>. So <code>wc -c program.com</code> is a clean apples-to-apples comparison.</p><h2 id="2-The-starting-point"><a href="#2-The-starting-point" class="headerlink" title="2. The starting point"></a>2. The starting point</h2><p>Before this session, the compiler was already past “naive”. The previous commit (<code>6a9ed5d wlp4gen: trivial-leaf frame elision + tail-call optimization</code>) had two big wins in place:</p><ul><li><strong>Trivial-leaf frame elision</strong> — procedures with no locals, no <code>&amp;param</code>, no calls, and whose body is a single <code>return expr;</code> skip prologue&#x2F;epilogue entirely. Just compute <code>expr</code>, <code>br x30</code>.</li><li><strong>Tail-call optimization</strong> — <code>return f(...)</code> reuses the caller’s frame.</li></ul><p>What was missing was:</p><ol><li>A regression net I could trust before changing codegen.</li><li>Hard numbers on whether the optimizations were actually paying off.</li><li>A way to develop on macOS without manually round-tripping every test through a Linux VM (the course tools are x86-64 ELF only).</li></ol><p>So the work split into two threads: <strong>infrastructure first, then talk about codegen</strong>.</p><h2 id="3-Phase-1-—-Building-a-safety-net"><a href="#3-Phase-1-—-Building-a-safety-net" class="headerlink" title="3. Phase 1 — Building a safety net"></a>3. Phase 1 — Building a safety net</h2><p>Commit <a href="https://github.com/faketut/C-class-compiler/commit/a6f4a75"><code>a6f4a75</code></a>. 12 new test programs, a portable test runner, a GitHub Actions workflow.</p><p>The expanded corpus was deliberately picked to cover the parts of <code>wlp4gen</code> most likely to silently regress:</p><ul><li><strong>Parameter-count boundary cases</strong> (<code>four_args</code>, <code>six_args</code>, <code>eight_args</code>) — the ARM64 calling convention uses x0–x7 for the first 8 args; the 9th lives on the stack. The code path that spills overflow params is exercised exactly once in normal usage; without an explicit test it’s easy to break.</li><li><strong>Pointer arithmetic</strong> (<code>ptr_arith_sub</code>, <code>ptr_ptr_sub</code>) — <code>long*</code> − <code>long*</code> returns the <strong>element distance</strong>, not the byte distance. Three different sites need to agree on that fact.</li><li><strong>Heap with loops</strong> (<code>alloc_loop</code>) — exercises the runtime <code>init</code>&#x2F;<code>new</code>&#x2F;<code>delete</code> imports plus the linker.</li><li><strong>Nested calls</strong> (<code>nested_call</code>) — call-saves around argument evaluation.</li></ul><h3 id="The-portable-runner"><a href="#The-portable-runner" class="headerlink" title="The portable runner"></a>The portable runner</h3><p><a href="../scripts/run-tests.sh"><code>scripts/run-tests.sh</code></a> does one branch at the top:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">if</span> [[ <span class="string">&quot;<span class="subst">$(uname -s)</span>&quot;</span> == <span class="string">&quot;Darwin&quot;</span> ]]; <span class="keyword">then</span></span><br><span class="line">  <span class="built_in">exec</span> colima ssh -- bash -lc <span class="string">&quot;cd &#x27;<span class="variable">$ROOT</span>&#x27; &amp;&amp; bash scripts/run-tests.sh&quot;</span></span><br><span class="line"><span class="keyword">fi</span></span><br></pre></td></tr></table></figure><p>On macOS, it re-exec’s itself inside <a href="https://github.com/abiosoft/colima">colima</a> (a small Lima VM running x86_64 Ubuntu) so the course’s Linux binaries Just Work. On Linux (CI), it runs directly. This costs ~5 seconds of VM ssh overhead per local run, and zero in CI. <em>No code duplication, no environment matrix, no flaky cross-compilation</em>.</p><h3 id="One-subtle-bug-I-hit-later"><a href="#One-subtle-bug-I-hit-later" class="headerlink" title="One subtle bug I hit later"></a>One subtle bug I hit later</h3><p>A few hours in I started seeing <strong>intermittent failures</strong> — different tests would fail on each invocation. Classic race. The fix was a one-line change in Phase 2A:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># Per-invocation scratch dir so concurrent runs don&#x27;t clobber each other.</span></span><br><span class="line">TMP=$(<span class="built_in">mktemp</span> -d)</span><br><span class="line"><span class="built_in">trap</span> <span class="string">&#x27;rm -rf &quot;$TMP&quot;&#x27;</span> EXIT</span><br></pre></td></tr></table></figure><p>The original script used hardcoded <code>/tmp/got.wlp4ti</code>, <code>/tmp/our.com</code>, etc. Fine for serial runs, broken the moment two <code>colima ssh</code> sessions overlap (which happens whenever an editor agent kicks off a verification while a previous one is still draining). Lesson: even for a “one-developer test script”, <code>mktemp -d</code> is one line for an unbounded amount of debugging avoided.</p><h2 id="4-Phase-2A-—-Diagnostics-that-survive-contact-with-reality"><a href="#4-Phase-2A-—-Diagnostics-that-survive-contact-with-reality" class="headerlink" title="4. Phase 2A — Diagnostics that survive contact with reality"></a>4. Phase 2A — Diagnostics that survive contact with reality</h2><p>Commit <a href="https://github.com/faketut/C-class-compiler/commit/c254eac"><code>c254eac</code></a>.</p><p>The original scanner printed <code>ERROR: unexpected character</code>. Type errors printed <code>ERROR: type mismatch</code>. Useless once your program is more than 20 lines.</p><p>The fix in <a href="../src/wlp4scan.cc">src&#x2F;wlp4scan.cc</a> is mechanical but worth describing because the <strong>shape of the change matters</strong>:</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="type">size_t</span> line_no = <span class="number">1</span>;</span><br><span class="line"><span class="type">size_t</span> lineStart = <span class="number">0</span>;</span><br><span class="line"><span class="keyword">auto</span> advancePos = [&amp;](<span class="type">size_t</span> k) &#123;</span><br><span class="line">    <span class="keyword">for</span> (<span class="type">size_t</span> i = <span class="number">0</span>; i &lt; k; ++i) &#123;</span><br><span class="line">        <span class="keyword">if</span> (input[pos + i] == <span class="string">&#x27;\n&#x27;</span>) &#123;</span><br><span class="line">            ++line_no;</span><br><span class="line">            lineStart = pos + i + <span class="number">1</span>;</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">    pos += k;</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure><p>Two state variables, one lambda. Every <code>pos += longest</code> became <code>advancePos(longest)</code>; whitespace and comment skips update inline. Now errors say <code>ERROR scan at line 14 col 9: unexpected &#39;@&#39;</code>. The point is: this is a <strong>non-invasive</strong> instrumentation. The scanner’s hot path got one extra <code>if (input[pos+i] == &#39;\n&#39;)</code> per character — negligible — and the rest of the file is untouched.</p><p>For <a href="../src/wlp4type.cc"><code>wlp4type</code></a>, the trick was a single file-scoped <code>static string g_curProc;</code> that gets set at the top of each <code>for (Node* proc : procedureNodes)</code> iteration. Every existing <code>reportError(detail)</code> site now produces <code>ERROR type in foo(): &lt;detail&gt;</code> without touching the dozens of error sites individually. Surgical change &gt; rewrite.</p><h2 id="5-The-optimizations-that-earned-their-keep"><a href="#5-The-optimizations-that-earned-their-keep" class="headerlink" title="5. The optimizations that earned their keep"></a>5. The optimizations that earned their keep</h2><p>Now the meaty part. The codegen in <a href="../src/wlp4gen.cc">src&#x2F;wlp4gen.cc</a> is 1300 lines. Here are the ideas that actually moved the benchmark needle, in approximate order of contribution.</p><h3 id="5-1-Trivial-leaf-frame-elision-pre-existing"><a href="#5-1-Trivial-leaf-frame-elision-pre-existing" class="headerlink" title="5.1 Trivial-leaf frame elision (pre-existing)"></a>5.1 Trivial-leaf frame elision (pre-existing)</h3><p>For a procedure like:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="type">long</span> <span class="title function_">add</span><span class="params">(<span class="type">long</span> a, <span class="type">long</span> b)</span> &#123; <span class="keyword">return</span> a + b; &#125;</span><br></pre></td></tr></table></figure><p>Reference compiler emits ~30 instructions: prologue with <code>sub sp</code>, <code>stur x29</code>, <code>stur x30</code>, body, epilogue with restores, <code>br x30</code>. We emit:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">Padd:</span><br><span class="line">  add x0, x0, x1</span><br><span class="line">  br x30</span><br></pre></td></tr></table></figure><p>Two instructions. <strong>The frame setup is dead weight when the function has no locals, no <code>&amp;p</code>, no calls, and the body fits the pattern <code>return expr;</code></strong> Walking the AST once at the top of <code>emitProcedure</code> to check this is cheap and unlocks a 90%+ size win on any small leaf — which is the majority of WLP4 test programs.</p><p>This was inherited from the prior commit, but I want to flag it because the benchmark would be ~−40% instead of ~−80% without it. <strong>The biggest optimization is the one you can avoid emitting code for entirely.</strong></p><h3 id="5-2-Tail-call-optimization-pre-existing"><a href="#5-2-Tail-call-optimization-pre-existing" class="headerlink" title="5.2 Tail-call optimization (pre-existing)"></a>5.2 Tail-call optimization (pre-existing)</h3><p><code>return f(args)</code> reuses the caller’s frame: jump to <code>f</code> with <code>b Pf</code> instead of <code>blr Pf; br x30</code>. Combined with §5.1, recursive functions like <code>fact</code> end up tight loops.</p><h3 id="5-3-The-per-procedure-literal-pool-with-dedup"><a href="#5-3-The-per-procedure-literal-pool-with-dedup" class="headerlink" title="5.3 The per-procedure literal pool with dedup"></a>5.3 The per-procedure literal pool with dedup</h3><p>This is the most architecturally interesting piece, because the ARM64 subset <em>forces</em> it: there’s no immediate-form <code>add</code> or <code>mov</code>. You cannot say <code>add x0, x0, #4</code>. You can only <code>add</code> register to register. So every numeric constant has to come from memory, loaded via PC-relative <code>ldr xN, imm</code>.</p><p>The reference compiler’s approach: every time you need a constant, emit a 5-instruction sequence (<code>ldr xN, 8; b 12; .8byte K; ...</code>) that hops over an inline 8-byte literal. Three uses of <code>4</code> → three inline literals → 60 bytes.</p><p>Our approach in <a href="../src/wlp4gen.cc"><code>finalizeLiteralPool</code></a>:</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="type">void</span> <span class="title">emitLoadLitPayload</span><span class="params">(<span class="type">int</span> reg, <span class="type">const</span> string&amp; payload)</span> </span>&#123;</span><br><span class="line">    <span class="keyword">auto</span> [it, inserted] = payloadToId.<span class="built_in">try_emplace</span>(payload, idToPayload.<span class="built_in">size</span>());</span><br><span class="line">    <span class="keyword">if</span> (inserted) idToPayload.<span class="built_in">push_back</span>(payload);</span><br><span class="line">    string tag = <span class="built_in">fmt</span>(<span class="string">&quot;PFIX&quot;</span>, it-&gt;second, <span class="string">&quot;!&quot;</span>);      <span class="comment">// sentinel for patching</span></span><br><span class="line">    fixups.<span class="built_in">push_back</span>(&#123;tag, payload&#125;);</span><br><span class="line">    <span class="built_in">emit</span>(<span class="built_in">fmt</span>(<span class="string">&quot;  ldr x&quot;</span>, reg, <span class="string">&quot;, &quot;</span>, tag));</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>Each unique constant gets <strong>one</strong> <code>.8byte</code> slot at the end of the procedure, and every load is a single <code>ldr xN, &lt;pc_offset&gt;</code> whose offset gets patched in <code>finalizeLiteralPool</code> after we know the final layout. The patching is a simple two-pass linear scan: first pass records the byte address of every emitted line, second pass replaces the <code>PFIXn!</code> tag with the computed signed offset.</p><p>Concrete impact: a 5-call program with <code>4</code> used 5 times costs us 8 bytes for the slot + 5 × 4 bytes for the <code>ldr</code> &#x3D; 28 bytes. The reference: 5 × ~20 bytes &#x3D; 100 bytes. And literal pools amortize <em>across the whole procedure</em>, so the savings compound with size.</p><p>This single mechanism is, I’d estimate, half of the total benchmark win.</p><h3 id="5-4-Constant-folding-in-isConst"><a href="#5-4-Constant-folding-in-isConst" class="headerlink" title="5.4 Constant folding in isConst"></a>5.4 Constant folding in <code>isConst</code></h3><p>The arithmetic chain templates in the generated corpus show the most extreme ratio: <code>arith_chain_8</code> is <strong>−91.6%</strong> smaller. Why?</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="type">long</span> <span class="title function_">wain</span><span class="params">(<span class="type">long</span> a, <span class="type">long</span> b)</span> &#123;</span><br><span class="line">  <span class="keyword">return</span> (((((((<span class="number">28</span> * <span class="number">41</span>) - <span class="number">18</span>) + <span class="number">38</span>) * <span class="number">23</span>) + <span class="number">49</span>) * <span class="number">50</span>) + <span class="number">12</span>);</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p><a href="../src/wlp4gen.cc#L228"><code>isConst</code></a> walks the expression tree and folds <code>+ − × ÷ %</code> into a single literal. Combined with §5.1 (trivial-leaf elision), the entire procedure becomes:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">Pwain:</span><br><span class="line">  ldr x0, 8</span><br><span class="line">  br x30</span><br><span class="line">  .8byte &lt;folded value&gt;</span><br></pre></td></tr></table></figure><p>Six bytes of useful code. The reference compiler builds the full AST evaluator at runtime: load 28, load 41, mul, …, one constant-load + arithmetic chain per operator. For an 8-operand chain that’s ~50 instructions of dead work.</p><p>Note that <code>isConst</code> only folds when <em>both</em> operands are themselves constant — it doesn’t try partial evaluation, it doesn’t reorder for associativity, it doesn’t fold across pointer types. <strong>The simple cases handle 90% of opportunities.</strong></p><h3 id="5-5-Parameter-local-promotion-to-callee-saved-registers"><a href="#5-5-Parameter-local-promotion-to-callee-saved-registers" class="headerlink" title="5.5 Parameter &#x2F; local promotion to callee-saved registers"></a>5.5 Parameter &#x2F; local promotion to callee-saved registers</h3><p><code>emitPrologue</code> checks: if the procedure uses ≤ 9 named values (params + locals) and never takes <code>&amp;</code> of any of them, all of them get assigned to <code>x19..x27</code> instead of stack slots. The epilogue only saves&#x2F;restores the registers actually used. The frame, if no other reason exists for it, gets a smaller <code>belowFpBytes</code>.</p><p>This is <em>the</em> single optimization most likely to break things — register allocation has to stay consistent across calls (save before, restore after) and across <code>if/while</code> branches. The way I keep it tractable: a single <code>regTab: id → reg</code> map per procedure built during prologue, consulted everywhere a local is read&#x2F;written, and <em>no further changes</em> to the rest of the codegen. Either an id is in <code>regTab</code> (use the register) or it’s not (use frame offsets). One state machine, no per-statement bookkeeping.</p><h2 id="6-Phase-5-—-Measuring-like-we-mean-it"><a href="#6-Phase-5-—-Measuring-like-we-mean-it" class="headerlink" title="6. Phase 5 — Measuring like we mean it"></a>6. Phase 5 — Measuring like we mean it</h2><p>Commit <a href="https://github.com/faketut/C-class-compiler/commit/55a2576"><code>55a2576</code></a> added <a href="../tools/bench.sh"><code>tools/bench.sh</code></a>. Three things matter about how it’s structured:</p><ol><li><strong>It runs the full compile + link pipeline on both sides</strong>, then <code>wc -c</code> the resulting <code>.com</code> files. Both go through the same <code>linkasm</code>, so footer&#x2F;header overhead cancels out — the delta is the program section.</li><li><strong>It re-execs into colima on macOS automatically</strong>, same trick as the test runner. Zero friction to run locally.</li><li><strong>It generates CSV</strong> rather than a pretty table, so I can pipe it into anything (the GitHub Actions step posts a summary to <code>$GITHUB_STEP_SUMMARY</code> and uploads the raw CSV as a downloadable artifact).</li></ol><p>Commit <a href="https://github.com/faketut/C-class-compiler/commit/2350dc4"><code>2350dc4</code></a> added <a href="../tools/gen_random_wlp4.py"><code>tools/gen_random_wlp4.py</code></a>: 8 parametric templates (arithmetic chains, local sums, if-ladders, while-sums, multi-arg procs, nested calls, pointer walks, recursive fib) seeded deterministically. This grew the benchmark from 25 hand-written tests to 65 programs.</p><h3 id="The-data"><a href="#The-data" class="headerlink" title="The data"></a>The data</h3><table><thead><tr><th>Corpus</th><th align="right">Files</th><th align="right">Ours (bytes)</th><th align="right">Reference (bytes)</th><th align="right">Delta</th></tr></thead><tbody><tr><td>Hand-written</td><td align="right">25</td><td align="right">7,212</td><td align="right">36,668</td><td align="right"><strong>−80.33%</strong></td></tr><tr><td>Hand + generated</td><td align="right">65</td><td align="right">19,588</td><td align="right">95,976</td><td align="right"><strong>−79.59%</strong></td></tr></tbody></table><p>The −80% holds steady when corpus size and shape change. That’s the validation I wanted before claiming the result generalizes.</p><h3 id="Picking-apart-a-single-program"><a href="#Picking-apart-a-single-program" class="headerlink" title="Picking apart a single program"></a>Picking apart a single program</h3><p>For <code>arith_chain_4</code>:</p><table><thead><tr><th></th><th align="right">Ours</th><th align="right">Reference</th></tr></thead><tbody><tr><td><code>.com</code> total bytes</td><td align="right">124</td><td align="right">1,328</td></tr><tr><td>Literal pool bytes (our side)</td><td align="right">32</td><td align="right">—</td></tr><tr><td>Reduction</td><td align="right"><strong>−90.7%</strong></td><td align="right"></td></tr></tbody></table><p>124 bytes is essentially: ARMCOM header (20) + 6 instructions (24) + an aligned literal slot (8) + footer (~70). The compiler is at the floor; the remaining bytes are format overhead.</p><h3 id="One-amusing-failure-of-the-generator"><a href="#One-amusing-failure-of-the-generator" class="headerlink" title="One amusing failure of the generator"></a>One amusing failure of the generator</h3><p>My first cut of <code>t_recursive</code> produced:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="type">long</span> <span class="title function_">f</span><span class="params">(<span class="type">long</span> n)</span> &#123;</span><br><span class="line">    <span class="keyword">if</span> (n &lt;= <span class="number">0</span>) &#123; <span class="keyword">return</span> n; &#125;</span><br><span class="line">    <span class="keyword">else</span> &#123; <span class="keyword">return</span> f(n - <span class="number">1</span>) + f(n - <span class="number">2</span>); &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>…which is invalid WLP4. The grammar requires <strong>exactly one trailing <code>return</code> per procedure body, never inside <code>if/else</code></strong>. I caught it because the parser flagged 7 out of 40 generated programs as <code>unexpected token &#39;RETURN&#39; (#15) in state 131</code>. Three-line fix using a <code>result</code> variable; the generator now emits valid WLP4 100% of the time.</p><p>Lesson: <strong>a noisy parser is a feature, not a bug</strong>. If you can’t tell what’s wrong from the error, your generator&#x2F;optimizer&#x2F;refactor will silently swallow problems for hours. The Phase 2A <code>line:col</code> work paid for itself on the first non-trivial use.</p><h2 id="7-Phase-4-—-CMake-a-real-driver"><a href="#7-Phase-4-—-CMake-a-real-driver" class="headerlink" title="7. Phase 4 — CMake + a real driver"></a>7. Phase 4 — CMake + a real driver</h2><p>Commit <a href="https://github.com/faketut/C-class-compiler/commit/ebd2f47"><code>ebd2f47</code></a>. The repo had <code>build-toolchain.sh</code> (4 invocations of <code>g++</code>) which was fine — but for anyone running the project from an IDE that has CMake integration, it was friction. Adding a minimal <a href="../CMakeLists.txt"><code>CMakeLists.txt</code></a> was 30 lines of cmake + a <code>WLP4_WERROR</code> option for CI. The shell script stays as the zero-dependency fast path.</p><p>The <a href="../bin/wlp4"><code>bin/wlp4</code></a> driver is more interesting. It does:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">wlp4 [-S | -c] [-o OUT] SRC.wlp4</span><br></pre></td></tr></table></figure><p>with the macOS routing trick for <code>-c</code>:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">if</span> [[ <span class="string">&quot;<span class="variable">$uname_s</span>&quot;</span> == <span class="string">&quot;Darwin&quot;</span> ]] &amp;&amp; <span class="built_in">command</span> -v colima &gt;/dev/null 2&gt;&amp;1; <span class="keyword">then</span></span><br><span class="line">    colima ssh -- bash -lc <span class="string">&quot;cat &gt; /tmp/.wlp4-in.asm &amp;&amp; \</span></span><br><span class="line"><span class="string">        &#x27;<span class="variable">$ROOT</span>/bin_ref/linkasm&#x27; &lt; /tmp/.wlp4-in.asm&quot;</span> &lt; <span class="string">&quot;<span class="variable">$asm_tmp</span>&quot;</span> &gt; <span class="string">&quot;<span class="variable">$out</span>&quot;</span></span><br><span class="line"><span class="keyword">else</span></span><br><span class="line">    <span class="comment"># native path</span></span><br><span class="line">    <span class="string">&quot;<span class="variable">$LINKASM</span>&quot;</span> &lt; <span class="string">&quot;<span class="variable">$asm_tmp</span>&quot;</span> &gt; <span class="string">&quot;<span class="variable">$out</span>&quot;</span></span><br><span class="line"><span class="keyword">fi</span></span><br></pre></td></tr></table></figure><p><code>bin/wlp4 -c test/procedures/proc.wlp4</code> now Just Works on either host. This is <em>not</em> a big feature, but it removed a per-test mental tax that was discouraging quick experimentation.</p><h2 id="8-What-I-deliberately-did-not-do"><a href="#8-What-I-deliberately-did-not-do" class="headerlink" title="8. What I deliberately did not do"></a>8. What I deliberately did <em>not</em> do</h2><p>Equally important. The original plan had nine work items; only six landed. Here’s what was cut and why.</p><h3 id="Self-implemented-linkasm-binasm-linker-striparmcom"><a href="#Self-implemented-linkasm-binasm-linker-striparmcom" class="headerlink" title="Self-implemented linkasm &#x2F; binasm &#x2F; linker-striparmcom"></a>Self-implemented <code>linkasm</code> &#x2F; <code>binasm</code> &#x2F; <code>linker-striparmcom</code></h3><p>The pitch: own the entire toolchain instead of vendoring Linux binaries from <code>bin_ref/</code>. The cost: ~1k–1.5k LoC of reverse-engineering, with <strong>no formal spec for the assembler syntax</strong>. The doc <code>docs/armcom.txt</code> is 60 lines and covers only the binary <code>.com</code> format, not the input language to the assembler. Every ARM64 mnemonic the codegen emits would need a hand-rolled encoder, validated byte-for-byte against the reference.</p><p>The benefit: native macOS testing without colima. <em>Real value, but not on the critical path for any user-visible improvement.</em> Skipped.</p><h3 id="Parse-table-extraction-constexpr-arrays"><a href="#Parse-table-extraction-constexpr-arrays" class="headerlink" title="Parse-table extraction (constexpr arrays)"></a>Parse-table extraction (constexpr arrays)</h3><p><a href="../src/parse_tables.h"><code>src/parse_tables.h</code></a> embeds the LR tables as giant raw string literals; <code>wlp4parse</code> re-tokenizes them at startup. Replacing with <code>constexpr</code> arrays would shave the parser binary by ~30 KB and save a few ms of startup. Skipped: zero impact on any benchmark, full impact on the risk of breaking the parser on a <code>.wlp4i</code> shape we don’t have a test for.</p><h3 id="Further-wlp4gen-micro-optimizations-dead-branch-elimination-register-resident-i-i-1"><a href="#Further-wlp4gen-micro-optimizations-dead-branch-elimination-register-resident-i-i-1" class="headerlink" title="Further wlp4gen micro-optimizations (dead-branch elimination, register-resident i = i + 1)"></a>Further wlp4gen micro-optimizations (dead-branch elimination, register-resident <code>i = i + 1</code>)</h3><p>The benchmark is already at −80%. The remaining headroom is in patterns that essentially don’t occur in real WLP4 programs:</p><ul><li><code>if (1 == 1)</code> — nobody writes this; the constant-folded test never fires.</li><li><code>while (0) { ... }</code> — same.</li><li><code>i = i + 1</code> collapsed into a single <code>add</code> — only saves cycles, not bytes, and only when <code>i</code> is already in a register. Maybe 1–2% on tight loops <em>if</em> I’m careful about correctness.</li></ul><p>Risk-adjusted, these are negative-EV. Calling them out as “deferred until there’s a real driver” rather than secretly skipping them.</p><h3 id="The-discipline"><a href="#The-discipline" class="headerlink" title="The discipline"></a>The discipline</h3><p>Karpathy’s <a href="https://github.com/karpathy/...">behavioral guidelines</a> say: <em>don’t add abstractions for one-time operations; don’t refactor code that isn’t broken; every changed line should trace to the user’s request</em>. Applied to compiler work: <strong>don’t add an optimization that won’t show up on a benchmark you’ve already built.</strong> The benchmark is the success criterion. If it doesn’t move, the optimization didn’t happen.</p><h2 id="9-Lessons"><a href="#9-Lessons" class="headerlink" title="9. Lessons"></a>9. Lessons</h2><p>A few generalizable things, in order of how often I had to re-learn them:</p><ol><li><p><strong>Build the measurement before the optimization.</strong> I had Phase 1’s CI + Phase 5’s benchmark before I touched any codegen this session. Every subsequent decision had a number attached. The −80% headline is only meaningful because I can point at the script and the corpus that produced it.</p></li><li><p><strong>A safety net plus a noisy error message ≈ unlimited iteration budget.</strong> Phase 1 (regression tests) and Phase 2A (line:col diagnostics) combined cost about 2 hours and saved an unknowable but large amount of debugging time. The flaky-tests episode in §3 would have been hours of head-scratching without <code>line:col</code> confirming the scanner was producing identical output on retry.</p></li><li><p><strong>Surgical &gt; rewrite.</strong> The scanner diagnostics change is two state variables, one lambda. The type-pass change is one static string. The test runner <code>/tmp</code> race fix is <code>TMP=$(mktemp -d)</code>. Each ships in a commit with a clear blast radius. Compare against the alternative of “while we’re in there, let’s refactor”.</p></li><li><p><strong>Restrictive targets force good architecture.</strong> The ARM64 subset has no immediate-form arithmetic. That’s annoying for a one-shot translator but it forces the literal-pool design, which then gives you dedup almost for free, which then gives you most of the size win. The constraint <em>was</em> the optimization.</p></li><li><p><strong>Know when to stop.</strong> Six commits in, three planned items remained, all with the same property: high effort, negligible benchmark impact, real regression risk. The right call is to <em>write up the work</em> and put down the keyboard, not to continue grinding for marginal numbers. That’s this blog post.</p></li></ol><h2 id="10-Appendix-full-benchmark-table"><a href="#10-Appendix-full-benchmark-table" class="headerlink" title="10. Appendix: full benchmark table"></a>10. Appendix: full benchmark table</h2><p>See <a href="benchmark.csv">docs&#x2F;benchmark.csv</a> for the raw 65-row table. The columns:</p><ul><li><code>name</code> — program identifier (test file basename)</li><li><code>our_bytes</code> — <code>wc -c</code> of <code>wlp4{scan|parse|type|gen} | linkasm</code> output</li><li><code>ref_bytes</code> — <code>wc -c</code> of <code>wlp4c</code> output</li><li><code>delta_bytes</code> &#x3D; <code>our_bytes − ref_bytes</code> (negative is smaller)</li><li><code>delta_pct</code> &#x3D; <code>100 × delta_bytes / ref_bytes</code></li><li><code>our_pool</code> — bytes in our literal pool (8 × count of <code>.8byte</code> lines)</li><li><code>ref_pool</code> — left at 0 (we don’t have the reference’s intermediate asm)</li></ul><p>Top 5 wins (smaller is more dramatic):</p><table><thead><tr><th>name</th><th align="right">our_bytes</th><th align="right">ref_bytes</th><th align="right">delta_pct</th></tr></thead><tbody><tr><td>arith_chain_8</td><td align="right">124</td><td align="right">1,472</td><td align="right">−91.58%</td></tr><tr><td>arith_chain_7</td><td align="right">124</td><td align="right">1,436</td><td align="right">−91.36%</td></tr><tr><td>arith_chain_4</td><td align="right">124</td><td align="right">1,328</td><td align="right">−90.66%</td></tr><tr><td>arith_chain_3</td><td align="right">124</td><td align="right">1,292</td><td align="right">−90.40%</td></tr><tr><td>wain_ptr</td><td align="right">140</td><td align="right">1,240</td><td align="right">−88.71%</td></tr></tbody></table><p>Bottom 5 (smallest wins, where overhead matters most):</p><table><thead><tr><th>name</th><th align="right">our_bytes</th><th align="right">ref_bytes</th><th align="right">delta_pct</th></tr></thead><tbody><tr><td>alloc_loop</td><td align="right">724</td><td align="right">1,968</td><td align="right">−63.21%</td></tr><tr><td>alloc_basic</td><td align="right">548</td><td align="right">1,572</td><td align="right">−65.14%</td></tr><tr><td>nested_call</td><td align="right">468</td><td align="right">1,592</td><td align="right">−70.60%</td></tr><tr><td>recursive</td><td align="right">404</td><td align="right">1,584</td><td align="right">−74.49%</td></tr><tr><td>eight_args</td><td align="right">396</td><td align="right">1,640</td><td align="right">−75.85%</td></tr></tbody></table><p>The pattern is clean: <strong>arithmetic-heavy</strong> programs benefit most (constant folding + tiny pools), <strong>heap and many-arg</strong> programs benefit least (linker-pulled <code>alloc.com</code>, mandatory parameter spilling for 8+ args). Every test in between sits in a tight band around −80%.</p><hr><h3 id="Reproducing"><a href="#Reproducing" class="headerlink" title="Reproducing"></a>Reproducing</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">git <span class="built_in">clone</span> https://github.com/faketut/C-class-compiler.git</span><br><span class="line"><span class="built_in">cd</span> C-class-compiler</span><br><span class="line">./build-toolchain.sh</span><br><span class="line">bash scripts/run-tests.sh        <span class="comment"># 25/25 should pass</span></span><br><span class="line">bash tools/bench.sh &gt; docs/benchmark.csv</span><br><span class="line"><span class="comment"># On macOS, install colima first: brew install colima &amp;&amp; colima start --arch x86_64</span></span><br></pre></td></tr></table></figure><p>The benchmark is deterministic (<code>gen_random_wlp4.py</code> is seeded). The CSV will reproduce byte-for-byte across runs.</p>]]>
      </content:encoded>
    </item>
    <item>
      <title>Windows-only desktop app, macOS-friendly contributors</title>
      <link>https://faketut.github.io/2026/05/17/ghostpilot-06-cross-os-dev/</link>
      <description>
        <![CDATA[<p>GhostPilot is a Windows-only app. The stealth-overlay trick (click-through, capture-resistant) only works with <code>SetWindowDisplayAffi]]>
      </description>
      <author>Jian Feng</author>
      <category domain="https://faketut.github.io/categories/engineering/">engineering</category>
      <category domain="https://faketut.github.io/categories/engineering/desktop/">desktop</category>
      <category domain="https://faketut.github.io/tags/ghostpilot/">ghostpilot</category>
      <category domain="https://faketut.github.io/tags/pyqt/">pyqt</category>
      <category domain="https://faketut.github.io/tags/llm/">llm</category>
      <pubDate>Sun, 17 May 2026 13:00:00 GMT</pubDate>
      <content:encoded>
        <![CDATA[<p>GhostPilot is a Windows-only app. The stealth-overlay trick (click-through, capture-resistant) only works with <code>SetWindowDisplayAffinity</code> and friends. System-audio loopback uses WASAPI. The whole pitch is Windows-specific.</p><p>And yet I develop on macOS.</p><p>The repo runs end-to-end on my MacBook — minus the OS-specific overlay tricks — within seconds of <code>git clone</code>. The CI matrix is Windows-only. Nothing is faked. Nothing is mocked at the architecture level.</p><p>Here’s the discipline that makes that possible. It’s three rules.</p><h2 id="The-shape"><a href="#The-shape" class="headerlink" title="The shape"></a>The shape</h2><pre class="mermaid">flowchart TB    subgraph Core[Cross-platform: 95% of the code]        LLM[LLMEngine]        RAG[RAGManager]        ASR[ASR client]        UI[Qt overlay logic]        REC[Session recorder]    end    subgraph Platform[Platform-specific shims]        Audio[audio_capture.py<br/>WASAPI / sounddevice]        Win[windows_api.py<br/>overlay tricks]        Path[Path helpers<br/>roaming vs Library vs .config]    end    Core --> Audio    Core --> Win    Core --> Path    Audio -.via env.- Sounddev[sounddevice<br/>mac/Linux mic]    Audio -.via env.- WASAPI[WASAPI loopback<br/>Windows system audio]</pre><p>Cross-platform code doesn’t know which platform it’s on. Platform-specific code is small, named, and isolated.</p><h2 id="Rule-1-one-capability-one-backend-abstraction"><a href="#Rule-1-one-capability-one-backend-abstraction" class="headerlink" title="Rule 1: one capability, one backend abstraction"></a>Rule 1: one capability, one backend abstraction</h2><p>The audio module is the canonical example:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># src/audio_capture.py</span></span><br><span class="line">BACKEND = os.environ.get(<span class="string">&quot;AUDIO_BACKEND&quot;</span>, <span class="string">&quot;auto&quot;</span>)</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">make_capturer</span>() -&gt; AudioCapturer:</span><br><span class="line">    backend = BACKEND</span><br><span class="line">    <span class="keyword">if</span> backend == <span class="string">&quot;auto&quot;</span>:</span><br><span class="line">        backend = <span class="string">&quot;wasapi&quot;</span> <span class="keyword">if</span> sys.platform == <span class="string">&quot;win32&quot;</span> <span class="keyword">else</span> <span class="string">&quot;sounddevice&quot;</span></span><br><span class="line">    <span class="keyword">if</span> backend == <span class="string">&quot;wasapi&quot;</span>:</span><br><span class="line">        <span class="keyword">from</span> src._audio_wasapi <span class="keyword">import</span> WasapiLoopbackCapturer</span><br><span class="line">        <span class="keyword">return</span> WasapiLoopbackCapturer()</span><br><span class="line">    <span class="keyword">if</span> backend == <span class="string">&quot;sounddevice&quot;</span>:</span><br><span class="line">        <span class="keyword">from</span> src._audio_sounddevice <span class="keyword">import</span> SoundDeviceCapturer</span><br><span class="line">        <span class="keyword">return</span> SoundDeviceCapturer()</span><br><span class="line">    <span class="keyword">raise</span> ValueError(<span class="string">f&quot;unknown AUDIO_BACKEND: <span class="subst">&#123;backend&#125;</span>&quot;</span>)</span><br></pre></td></tr></table></figure><p>Both <code>WasapiLoopbackCapturer</code> and <code>SoundDeviceCapturer</code> implement the same async interface:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">AudioCapturer</span>(<span class="title class_ inherited__">Protocol</span>):</span><br><span class="line">    <span class="keyword">async</span> <span class="keyword">def</span> <span class="title function_">start</span>(<span class="params">self</span>) -&gt; AsyncIterator[<span class="built_in">bytes</span>]: ...</span><br><span class="line">    <span class="keyword">async</span> <span class="keyword">def</span> <span class="title function_">stop</span>(<span class="params">self</span>) -&gt; <span class="literal">None</span>: ...</span><br></pre></td></tr></table></figure><p>Calling code never branches on platform. It calls <code>make_capturer()</code> and iterates. The factory is the only place that knows.</p><pre class="mermaid">flowchart LR    Caller[ASR client] --> F[make_capturer]    F -->|win32| W[WasapiLoopbackCapturer]    F -->|darwin/linux| S[SoundDeviceCapturer]    W --> I[bytes async iterator]    S --> I    I --> Caller</pre><p><strong>The win</strong>: when a macOS dev tests “does the ASR pipeline work end-to-end?”, they <code>export AUDIO_BACKEND=sounddevice</code>, speak into the mic, and the entire app runs. They lose system-audio loopback (a Windows-specific feature requiring BlackHole on Mac) but they can test 100% of the application logic. Iterations don’t require a Windows VM.</p><h2 id="Rule-2-every-“where-do-I-store-this-”-question-goes-through-one-function"><a href="#Rule-2-every-“where-do-I-store-this-”-question-goes-through-one-function" class="headerlink" title="Rule 2: every “where do I store this?” question goes through one function"></a>Rule 2: every “where do I store this?” question goes through one function</h2><p>OS file conventions are different and silent. Get it wrong once, your app writes to <code>~/Documents/GhostPilot/</code> on Windows and <code>$APPDATA/GhostPilot/</code> on macOS, and now you have a phantom data directory you’ll find six months later wondering “what is this.”</p><p>Centralize:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">_user_prompt_dir</span>() -&gt; Path:</span><br><span class="line">    <span class="keyword">if</span> sys.platform == <span class="string">&quot;win32&quot;</span>:</span><br><span class="line">        base = os.environ.get(<span class="string">&quot;APPDATA&quot;</span>) <span class="keyword">or</span> <span class="built_in">str</span>(Path.home() / <span class="string">&quot;AppData&quot;</span> / <span class="string">&quot;Roaming&quot;</span>)</span><br><span class="line">        <span class="keyword">return</span> Path(base) / <span class="string">&quot;GhostPilot&quot;</span> / <span class="string">&quot;prompts&quot;</span></span><br><span class="line">    <span class="keyword">if</span> sys.platform == <span class="string">&quot;darwin&quot;</span>:</span><br><span class="line">        <span class="keyword">return</span> Path.home() / <span class="string">&quot;Library&quot;</span> / <span class="string">&quot;Application Support&quot;</span> / <span class="string">&quot;GhostPilot&quot;</span> / <span class="string">&quot;prompts&quot;</span></span><br><span class="line">    <span class="keyword">return</span> Path.home() / <span class="string">&quot;.config&quot;</span> / <span class="string">&quot;GhostPilot&quot;</span> / <span class="string">&quot;prompts&quot;</span></span><br></pre></td></tr></table></figure><p>Every “writable user data” lookup goes through a function like this. Three branches, one place. If I get the macOS convention wrong, I fix it in one file.</p><p>The recordings directory does the same. The keyring lookup does the same. The cache directory does the same.</p><h2 id="Rule-3-paths-inside-files-are-always-POSIX"><a href="#Rule-3-paths-inside-files-are-always-POSIX" class="headerlink" title="Rule 3: paths inside files are always POSIX"></a>Rule 3: paths inside files are always POSIX</h2><p>This is the rule I learned the hard way, three Windows CI failures in a row:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># Wrong — works on Mac, breaks on Windows when the file is consumed cross-OS</span></span><br><span class="line">&#123;<span class="string">&quot;path&quot;</span>: <span class="built_in">str</span>(out.relative_to(<span class="variable language_">self</span>._<span class="built_in">dir</span>))&#125;</span><br><span class="line"><span class="comment"># → &quot;screenshots/0001.jpg&quot; on Mac</span></span><br><span class="line"><span class="comment"># → &quot;screenshots\\0001.jpg&quot; on Windows  ← breaks readers</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># Right — same string on every platform</span></span><br><span class="line">&#123;<span class="string">&quot;path&quot;</span>: out.relative_to(<span class="variable language_">self</span>._<span class="built_in">dir</span>).as_posix()&#125;</span><br></pre></td></tr></table></figure><p><code>Path</code> is platform-aware for <em>interacting with the OS</em>. For <em>serializing</em> (JSONL, config files, anything that might be read on a different machine), normalize to forward slashes. <code>.as_posix()</code> is the magic. Always use it before writing a path string to disk.</p><pre class="mermaid">graph LR    A[Path object] -->|interact with OS<br/>open, exists, stat| B[Native separator]    A -->|serialize<br/>JSON, YAML, db column| C[.as_posix → forward slash]    C -->|read back on any OS| D[Path parses correctly]</pre><h2 id="CI-matrix-test-where-you-ship-dev-where-you’re-fast"><a href="#CI-matrix-test-where-you-ship-dev-where-you’re-fast" class="headerlink" title="CI matrix: test where you ship, dev where you’re fast"></a>CI matrix: test where you ship, dev where you’re fast</h2><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># .github/workflows/ci.yml</span></span><br><span class="line"><span class="attr">strategy:</span></span><br><span class="line">  <span class="attr">matrix:</span></span><br><span class="line">    <span class="attr">os:</span> [<span class="string">windows-latest</span>]</span><br><span class="line">    <span class="attr">python-version:</span> [<span class="string">&quot;3.9&quot;</span>, <span class="string">&quot;3.12&quot;</span>]</span><br></pre></td></tr></table></figure><p>Note: <strong>Windows only.</strong> I don’t run CI on macOS even though I develop there. Here’s why:</p><ul><li>The macOS dev experience is “import + run pytest + ruff.” That’s already verified by my local pre-commit muscle memory.</li><li>The thing that breaks on Windows is <em>the platform-specific 5%</em>. Adding macOS to CI would catch nothing the Windows runner doesn’t, and would double my CI minutes.</li><li><strong>Python 3.9 + 3.12 matrix matters more than OS matrix.</strong> Half the bugs are 3.9 syntax that 3.12 silently accepts (PEP 604 unions, walrus in comprehension, etc.).</li></ul><p>The local guard:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># At the top of every module using new-style hints</span></span><br><span class="line"><span class="keyword">from</span> __future__ <span class="keyword">import</span> annotations</span><br></pre></td></tr></table></figure><p>This single line makes <code>list[str] | None</code> parse on 3.9 (as a string, deferred). Without it, the import crashes on the Windows runner with a <code>TypeError</code> that doesn’t fire on my Mac because I’m on 3.13.</p><h2 id="What-the-macOS-dev-gets-—-and-doesn’t"><a href="#What-the-macOS-dev-gets-—-and-doesn’t" class="headerlink" title="What the macOS dev gets — and doesn’t"></a>What the macOS dev gets — and doesn’t</h2><p>What works on macOS clone-to-running:</p><ul><li>All tests (102&#x2F;102 pass)</li><li>Lint, type check</li><li>ASR pipeline with <code>AUDIO_BACKEND=sounddevice</code></li><li>LLM streaming, RAG retrieval, session record&#x2F;replay</li><li>Settings UI, prompts editor, sessions viewer</li></ul><p>What doesn’t work — and is fine:</p><ul><li>The stealth overlay (Windows API)</li><li>System-audio loopback (needs BlackHole bridge)</li><li>Hotkey listener (Windows-specific implementation)</li><li>The build target (PyInstaller spec is Windows-only)</li></ul><p><strong>95% of the bugs live in the 95% of the code that’s cross-platform.</strong> That’s the part the macOS dev can iterate on. The 5% that’s Windows-only gets manual verification on a Windows VM at release time, not every commit.</p><h2 id="The-summary"><a href="#The-summary" class="headerlink" title="The summary"></a>The summary</h2><table><thead><tr><th>Discipline</th><th>What it buys</th></tr></thead><tbody><tr><td>Backend abstraction with env-var override</td><td>macOS dev can run the full app, not a stub</td></tr><tr><td>One function per “where does data live?”</td><td>No phantom directories, easy to fix when wrong</td></tr><tr><td><code>.as_posix()</code> for serialized paths</td><td>Recordings replay cross-OS</td></tr><tr><td>Windows-only CI, multi-Python matrix</td><td>Catch the real bugs (3.9 vs 3.12), skip the fake ones</td></tr><tr><td><code>from __future__ import annotations</code> at the top of every module</td><td>3.9 keeps parsing</td></tr></tbody></table><p>You can build a Windows-only app from a Mac. You just have to draw the line cleanly between <em>what your app does</em> and <em>how it does it on this specific OS</em>. Cross-platform code is the asset. Platform-specific shims are the cost. Keep the ratio honest and the dev loop stays tight.</p>]]>
      </content:encoded>
    </item>
    <item>
      <title>I made ruff CI-blocking. The whole repo changed 5 lines.</title>
      <link>https://faketut.github.io/2026/05/17/ghostpilot-05-ruff-blocking/</link>
      <description>
        <![CDATA[<p>Most lint cleanups happen the wrong way:</p>
<ol>
<li>Run the linter on a mature repo.</li>
<li>See 400 errors.</li>
<li>Disable half the]]>
      </description>
      <author>Jian Feng</author>
      <category domain="https://faketut.github.io/categories/engineering/">engineering</category>
      <category domain="https://faketut.github.io/categories/engineering/desktop/">desktop</category>
      <category domain="https://faketut.github.io/tags/ghostpilot/">ghostpilot</category>
      <category domain="https://faketut.github.io/tags/pyqt/">pyqt</category>
      <category domain="https://faketut.github.io/tags/llm/">llm</category>
      <pubDate>Sun, 17 May 2026 07:00:00 GMT</pubDate>
      <content:encoded>
        <![CDATA[<p>Most lint cleanups happen the wrong way:</p><ol><li>Run the linter on a mature repo.</li><li>See 400 errors.</li><li>Disable half the rules.</li><li>Lower the severity to “warning.”</li><li>Ship <code>continue-on-error: true</code> in CI.</li><li>Never look at the warnings again.</li></ol><p>Six months later the lint config is a graveyard of disabled rules, the CI step is a placebo, and adding a new rule is a project. The repo never gets cleaner.</p><p>Here’s a better path: <strong>gradual strictness</strong>. Three stages, each cheap, each independently shippable.</p><h2 id="The-three-stages"><a href="#The-three-stages" class="headerlink" title="The three stages"></a>The three stages</h2><pre class="mermaid">flowchart LR    S0[No linter] --> S1[Stage 1<br/>continue-on-error<br/>see the truth]    S1 --> S2[Stage 2<br/>fix in batches<br/>still non-blocking]    S2 --> S3[Stage 3<br/>make blocking<br/>tree is clean]    S3 --> S4[New rule?<br/>back to stage 1]</pre><p>The key insight: <strong>the only stage where you tolerate noise is stage 1, and only briefly.</strong> The point of stage 1 is to see the real shape of the problem, not to live there.</p><h2 id="Stage-1-install-run-collect-—-5-minutes"><a href="#Stage-1-install-run-collect-—-5-minutes" class="headerlink" title="Stage 1: install, run, collect — 5 minutes"></a>Stage 1: install, run, collect — 5 minutes</h2><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># .github/workflows/ci.yml</span></span><br><span class="line"><span class="bullet">-</span> <span class="attr">name:</span> <span class="string">Lint</span></span><br><span class="line">  <span class="attr">run:</span> <span class="string">ruff</span> <span class="string">check</span> <span class="string">.</span></span><br><span class="line">  <span class="attr">continue-on-error:</span> <span class="literal">true</span>   <span class="comment"># explicitly temporary</span></span><br></pre></td></tr></table></figure><p><code>continue-on-error</code> is doing one job: surfacing the violations in the CI log so you can see them. Not blocking the build. Not silencing them.</p><figure class="highlight toml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># pyproject.toml — start permissive</span></span><br><span class="line"><span class="section">[tool.ruff]</span></span><br><span class="line"><span class="attr">line-length</span> = <span class="number">100</span></span><br><span class="line"><span class="attr">target-version</span> = <span class="string">&quot;py39&quot;</span></span><br><span class="line"></span><br><span class="line"><span class="section">[tool.ruff.lint]</span></span><br><span class="line"><span class="attr">select</span> = [<span class="string">&quot;E&quot;</span>, <span class="string">&quot;F&quot;</span>, <span class="string">&quot;W&quot;</span>, <span class="string">&quot;I&quot;</span>]  <span class="comment"># errors, pyflakes, warnings, import sorting</span></span><br></pre></td></tr></table></figure><p>That’s the minimum useful ruleset. Add more later.</p><h2 id="Stage-2-batch-fix-don’t-piecemeal-fix"><a href="#Stage-2-batch-fix-don’t-piecemeal-fix" class="headerlink" title="Stage 2: batch-fix, don’t piecemeal-fix"></a>Stage 2: batch-fix, don’t piecemeal-fix</h2><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">ruff check . --output-format=concise</span><br><span class="line"><span class="comment"># Found 47 errors.</span></span><br></pre></td></tr></table></figure><p>Resist the temptation to fix them one at a time across 47 PRs. Two passes:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">ruff check . --fix       <span class="comment"># safe autofixes</span></span><br><span class="line">ruff check .             <span class="comment"># what&#x27;s left needs eyes</span></span><br></pre></td></tr></table></figure><p>The remaining errors fall into three groups:</p><table><thead><tr><th>Group</th><th>Strategy</th></tr></thead><tbody><tr><td>Real bugs (F841 unused var, F401 unused import)</td><td>Fix immediately, often with <code>--fix</code>.</td></tr><tr><td>Style (E501 line too long, I001 import order)</td><td><code>--fix</code> handles 95%.</td></tr><tr><td>Disagreements (E741 ambiguous name <code>l</code>)</td><td>Either rename or add an inline <code># noqa: E741</code> with a reason.</td></tr></tbody></table><p>In this repo, stage 2 was one commit:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">chore(d): remove dead self.text_client, fix lint, make ruff blocking</span><br><span class="line">- llm_engine: remove dead self.text_client legacy raw-SDK assignment</span><br><span class="line">- main.py: drop unused kb_watch_task local (F841)</span><br><span class="line">- settings_ui.py: hoist pathlib.Path import to module top instead of __import__ trick</span><br><span class="line">- tests: drop unused imports (AsyncMock, Path, pytest)</span><br><span class="line">- .github/workflows/ci.yml: remove continue-on-error from ruff step</span><br></pre></td></tr></table></figure><p>7 files, 8 insertions, 16 deletions. Now the lint passes.</p><h2 id="Stage-3-flip-the-switch"><a href="#Stage-3-flip-the-switch" class="headerlink" title="Stage 3: flip the switch"></a>Stage 3: flip the switch</h2><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># .github/workflows/ci.yml</span></span><br><span class="line"><span class="bullet">-</span> <span class="attr">name:</span> <span class="string">Lint</span></span><br><span class="line">  <span class="attr">run:</span> <span class="string">ruff</span> <span class="string">check</span> <span class="string">.</span></span><br><span class="line">  <span class="comment"># continue-on-error removed — must pass</span></span><br></pre></td></tr></table></figure><p>This is the entire stage 3 change. One line, deleted.</p><p>Now any future violation is a CI failure, surfaced on the PR, before review. The cost of fixing it is 10 seconds (<code>ruff check . --fix</code>). The cost of <em>not</em> enforcing it is unbounded.</p><h2 id="Why-this-works-where-“enable-everything-at-once”-fails"><a href="#Why-this-works-where-“enable-everything-at-once”-fails" class="headerlink" title="Why this works where “enable everything at once” fails"></a>Why this works where “enable everything at once” fails</h2><pre class="mermaid">graph TB    A[Enable strict linter on mature repo] --> B[400 errors]    B --> C{Choice}    C -->|Fix all| D[Massive PR<br/>impossible to review]    C -->|Disable rules| E[Lint config rot]    C -->|Ignore| F[Useless CI step]    A2[Gradual strictness] --> B2[Stage 1: see 47 errors]    B2 --> C2[Stage 2: fix in 1 commit]    C2 --> D2[Stage 3: enforce going forward]    D2 --> E2[Adding a new rule? Repeat]</pre><p>The mature-repo failure mode happens because the team conflates <strong>“the code needs to comply with this rule”</strong> with <strong>“the rule needs to be on right now.”</strong> Decoupling them buys you the breathing room to actually fix things.</p><h2 id="What-to-do-when-adding-a-new-rule"><a href="#What-to-do-when-adding-a-new-rule" class="headerlink" title="What to do when adding a new rule"></a>What to do when adding a new rule</h2><p>The exact same loop:</p><ol><li>Add the rule with <code>continue-on-error: true</code> (or use <code># noqa</code> on existing violations).</li><li>Run the linter, count the new violations.</li><li>Fix them in one batch — typically one commit per category.</li><li>Remove <code>continue-on-error</code>.</li></ol><p>If the count of violations is too high to fix in one batch, the rule is too aggressive for this codebase. Either narrow its scope (<code>per-file-ignores</code>) or don’t add it.</p><h2 id="ruff-specific-tricks-worth-knowing"><a href="#ruff-specific-tricks-worth-knowing" class="headerlink" title="ruff-specific tricks worth knowing"></a>ruff-specific tricks worth knowing</h2><figure class="highlight toml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="section">[tool.ruff.lint.per-file-ignores]</span></span><br><span class="line"><span class="comment"># Test files can have unused imports (fixtures) and ambiguous names</span></span><br><span class="line"><span class="attr">&quot;tests/*&quot;</span> = [<span class="string">&quot;E741&quot;</span>]</span><br><span class="line"><span class="comment"># Generated migrations</span></span><br><span class="line"><span class="attr">&quot;migrations/*&quot;</span> = [<span class="string">&quot;E501&quot;</span>, <span class="string">&quot;I001&quot;</span>]</span><br></pre></td></tr></table></figure><figure class="highlight toml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="section">[tool.ruff.lint.isort]</span></span><br><span class="line"><span class="attr">known-first-party</span> = [<span class="string">&quot;src&quot;</span>]</span><br></pre></td></tr></table></figure><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># Only fail on rules introduced after this date</span></span><br><span class="line">ruff check --<span class="keyword">select</span>=ALL --ignore=ANN --ignore=D .</span><br></pre></td></tr></table></figure><h2 id="The-lesson"><a href="#The-lesson" class="headerlink" title="The lesson"></a>The lesson</h2><p><strong>A linter that doesn’t fail the build is a comment in a config file.</strong> It signals intent without enforcing it, and intent decays. The gradual path lets you go from zero to enforced in three small steps, each one shippable on its own day, without ever blocking the team on a single mega-cleanup.</p><p>In this repo: from “no linter” to “ruff CI-blocking on Windows × Py 3.9 + 3.12” was three commits over two days. Zero painful PRs. The repo has stayed clean since.</p>]]>
      </content:encoded>
    </item>
    <item>
      <title>Hybrid RAG when your corpus has 50 chunks, not 5 million</title>
      <link>https://faketut.github.io/2026/05/17/ghostpilot-04-tiny-rag/</link>
      <description>
        <![CDATA[<p>Most RAG content assumes you have a vector DB, an embeddings budget, and tens of thousands of documents. The reality for personal tools,]]>
      </description>
      <author>Jian Feng</author>
      <category domain="https://faketut.github.io/categories/engineering/">engineering</category>
      <category domain="https://faketut.github.io/categories/engineering/desktop/">desktop</category>
      <category domain="https://faketut.github.io/tags/ghostpilot/">ghostpilot</category>
      <category domain="https://faketut.github.io/tags/pyqt/">pyqt</category>
      <category domain="https://faketut.github.io/tags/llm/">llm</category>
      <pubDate>Sun, 17 May 2026 01:00:00 GMT</pubDate>
      <content:encoded>
        <![CDATA[<p>Most RAG content assumes you have a vector DB, an embeddings budget, and tens of thousands of documents. The reality for personal tools, internal wikis, and onboarding bots is different: <strong>you have a handful of markdown files and a handful of seconds to retrieve from them.</strong></p><p>GhostPilot’s knowledge base is one file: <code>knowledge/resume.md</code>, ~50 chunks after splitting. Retrieval feeds question-answering during interviews. Latency budget: under 200ms. Embedding cost budget: ideally zero per query.</p><p>A dense-only retriever would be the wrong tool here. Here’s what actually works.</p><h2 id="The-retrieval-pipeline"><a href="#The-retrieval-pipeline" class="headerlink" title="The retrieval pipeline"></a>The retrieval pipeline</h2><pre class="mermaid">flowchart LR    Q[Question] --> S1[BM25<br/>top-k1]    Q --> S2[Dense<br/>top-k2]    S1 --> M[RRF merge]    S2 --> M    M --> R[Re-rank by score]    R --> T[Top-N chunks]    T --> P[Inject into prompt]</pre><p>Three observations drove this:</p><ol><li><strong>BM25 alone catches all the high-recall named-entity queries.</strong> “Did you work at Stripe?” — <code>Stripe</code> appears in exactly one chunk. Dense retrieval will return the right chunk <em>plus three near-misses about other fintech experience</em>. BM25 returns one chunk with a huge score gap.</li><li><strong>Dense alone catches all the paraphrased queries.</strong> “Tell me about a time you mentored someone” — the word <code>mentored</code> may not appear in the resume, but <code>helped junior engineers</code> and <code>led the intern program</code> will embed close.</li><li><strong>Either alone gets ~70% of queries right. Together they get ~95%.</strong></li></ol><h2 id="Reciprocal-Rank-Fusion-in-5-lines"><a href="#Reciprocal-Rank-Fusion-in-5-lines" class="headerlink" title="Reciprocal Rank Fusion in 5 lines"></a>Reciprocal Rank Fusion in 5 lines</h2><p>Forget weighted score combinations — the scores are on different scales and weights are a nightmare to tune. RRF is parameter-free and works:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">rrf</span>(<span class="params">rankings: <span class="built_in">list</span>[<span class="built_in">list</span>[<span class="built_in">str</span>]], k: <span class="built_in">int</span> = <span class="number">60</span></span>) -&gt; <span class="built_in">list</span>[<span class="built_in">tuple</span>[<span class="built_in">str</span>, <span class="built_in">float</span>]]:</span><br><span class="line">    scores: <span class="built_in">dict</span>[<span class="built_in">str</span>, <span class="built_in">float</span>] = &#123;&#125;</span><br><span class="line">    <span class="keyword">for</span> ranking <span class="keyword">in</span> rankings:</span><br><span class="line">        <span class="keyword">for</span> rank, doc_id <span class="keyword">in</span> <span class="built_in">enumerate</span>(ranking):</span><br><span class="line">            scores[doc_id] = scores.get(doc_id, <span class="number">0</span>) + <span class="number">1.0</span> / (k + rank)</span><br><span class="line">    <span class="keyword">return</span> <span class="built_in">sorted</span>(scores.items(), key=<span class="keyword">lambda</span> x: -x[<span class="number">1</span>])</span><br></pre></td></tr></table></figure><p><code>k=60</code> is the canonical choice from the original RRF paper. Larger <code>k</code> flattens the rank decay; smaller <code>k</code> makes top-1 matter more. 60 is fine. Stop tuning it.</p><h2 id="Why-BM25-still-wins-on-small-corpora"><a href="#Why-BM25-still-wins-on-small-corpora" class="headerlink" title="Why BM25 still wins on small corpora"></a>Why BM25 still wins on small corpora</h2><pre class="mermaid">graph TB    subgraph Small[Corpus: ~50 chunks]        S1[BM25: precise on named entities]        S2[Dense: catches paraphrasing]        S1 -.equal value.- S2    end    subgraph Large[Corpus: ~5M chunks]        L1[BM25: query needs exact terms]        L2[Dense: handles vocabulary mismatch at scale]        L2 -.dominates.- L1    end</pre><p>The dense-retrieval orthodoxy assumes the corpus is large enough that vocabulary mismatch is the main failure mode. On a 50-chunk corpus:</p><ul><li>Every named entity appears in 1-3 chunks. BM25’s IDF term gives them huge weight. Dense embeddings smear this signal across “semantically similar” chunks.</li><li>The query language is close to the document language (you’re paraphrasing your own resume). Dense retrieval’s vocabulary-bridging superpower is underused.</li><li>Most importantly: <strong>the failure modes are different.</strong> Dense fails by returning plausible-but-wrong neighbors. BM25 fails by returning nothing for paraphrased queries. Hybrid covers both.</li></ul><h2 id="Cost-zero-per-query"><a href="#Cost-zero-per-query" class="headerlink" title="Cost: ~zero per query"></a>Cost: ~zero per query</h2><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">RAGManager</span>:</span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">__init__</span>(<span class="params">self, chunks</span>):</span><br><span class="line">        <span class="variable language_">self</span>.bm25 = BM25Okapi([tokenize(c) <span class="keyword">for</span> c <span class="keyword">in</span> chunks])</span><br><span class="line">        <span class="comment"># Embeddings computed once at startup, cached on disk.</span></span><br><span class="line">        <span class="variable language_">self</span>.embeddings = <span class="variable language_">self</span>._load_or_compute(chunks)</span><br><span class="line"></span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">retrieve</span>(<span class="params">self, query: <span class="built_in">str</span>, top_n: <span class="built_in">int</span> = <span class="number">4</span></span>):</span><br><span class="line">        bm25_ranking = <span class="variable language_">self</span>._bm25_topk(query, k=<span class="number">10</span>)</span><br><span class="line">        dense_ranking = <span class="variable language_">self</span>._dense_topk(query, k=<span class="number">10</span>)</span><br><span class="line">        merged = rrf([bm25_ranking, dense_ranking])</span><br><span class="line">        <span class="keyword">return</span> [<span class="variable language_">self</span>.chunks[i] <span class="keyword">for</span> i, _ <span class="keyword">in</span> merged[:top_n]]</span><br></pre></td></tr></table></figure><p>Per-query cost:</p><table><thead><tr><th>Step</th><th>Cost</th></tr></thead><tbody><tr><td>BM25</td><td>&#96;O(</td></tr><tr><td>Dense</td><td>one embedding API call OR one local model forward (sentence-transformers all-MiniLM ~30ms on CPU)</td></tr><tr><td>RRF merge</td><td><code>O(k)</code>, microseconds</td></tr><tr><td><strong>Total</strong></td><td><strong>30-150ms, $0 if local embeddings</strong></td></tr></tbody></table><p>GhostPilot uses <code>sentence-transformers/all-MiniLM-L6-v2</code> locally. 80MB download, no API keys, runs on CPU. For 50 chunks, the embedding <em>computation at startup</em> is the dominant cost (~3 seconds, once). Per-query embedding of the user’s question is ~30ms.</p><h2 id="Hot-reload-because-the-corpus-is-small"><a href="#Hot-reload-because-the-corpus-is-small" class="headerlink" title="Hot reload because the corpus is small"></a>Hot reload because the corpus is small</h2><p>A 50-chunk corpus rebuilds in &lt;1 second. So instead of a separate ingestion pipeline:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">watcher = QFileSystemWatcher([<span class="built_in">str</span>(knowledge_dir)])</span><br><span class="line">watcher.directoryChanged.connect(</span><br><span class="line">    <span class="keyword">lambda</span>: loop.create_task(rag_manager.rebuild_async())</span><br><span class="line">)</span><br></pre></td></tr></table></figure><p>Edit a markdown file, save, the index updates before you’ve Alt-Tab’d back to the overlay. This is impossible at 5M chunks; it’s trivial at 50.</p><h2 id="When-to-switch-to-dense-only-or-a-vector-DB"><a href="#When-to-switch-to-dense-only-or-a-vector-DB" class="headerlink" title="When to switch to dense-only or a vector DB"></a>When to switch to dense-only or a vector DB</h2><pre class="mermaid">flowchart TD    A[Corpus size?] -->|< 1k chunks| B[Hybrid BM25 + dense<br/>in-memory, hot-reload]    A -->|1k - 100k| C[Hybrid + on-disk dense index<br/>FAISS, sqlite-vss]    A -->|> 100k| D[Dedicated vector DB<br/>Qdrant, Weaviate]    A -->|> 10M| E[Sharded ANN + filtering pipeline]</pre><p>The threshold for needing a vector DB is much higher than the marketing implies. Below ~10k chunks, hybrid retrieval in process beats anything else on latency, cost, and operational complexity combined.</p><h2 id="Things-that-didn’t-make-the-cut"><a href="#Things-that-didn’t-make-the-cut" class="headerlink" title="Things that didn’t make the cut"></a>Things that didn’t make the cut</h2><ul><li><strong>Cross-encoder re-ranker.</strong> Tested, added ~150ms and ~3% recall. Not worth it at this scale.</li><li><strong>Query expansion.</strong> The LLM does this implicitly — adding it in retrieval was redundant noise.</li><li><strong>Chunk overlap.</strong> Tried 20% overlap, made BM25 fire twice on the same content, hurt RRF. Pure non-overlapping chunks at sentence boundaries won.</li></ul><h2 id="TL-DR"><a href="#TL-DR" class="headerlink" title="TL;DR"></a>TL;DR</h2><p>For corpora under ~10k chunks:</p><ol><li>Use BM25 <em>and</em> dense embeddings. Each catches what the other misses.</li><li>Merge with RRF, not weighted scores. Stop tuning weights.</li><li>Keep everything in memory. Hot-reload on file change.</li><li>Run embeddings locally. The 80MB model is enough.</li><li>Resist the urge to add a vector DB. You don’t need it.</li></ol><p>The whole <code>rag_manager.py</code> is ~200 lines including the watcher. Total p95 retrieval latency on my machine: 47ms.</p>]]>
      </content:encoded>
    </item>
    <item>
      <title>One event loop to rule them all: PyQt6 + asyncio in production</title>
      <link>https://faketut.github.io/2026/05/16/ghostpilot-03-qasync-prod/</link>
      <description>
        <![CDATA[<p>The Python desktop-app stack has a coordination problem.</p>
<ul>
<li>Qt wants to own the main thread.</li>
<li>asyncio wants to own the]]>
      </description>
      <author>Jian Feng</author>
      <category domain="https://faketut.github.io/categories/engineering/">engineering</category>
      <category domain="https://faketut.github.io/categories/engineering/desktop/">desktop</category>
      <category domain="https://faketut.github.io/tags/ghostpilot/">ghostpilot</category>
      <category domain="https://faketut.github.io/tags/pyqt/">pyqt</category>
      <category domain="https://faketut.github.io/tags/llm/">llm</category>
      <pubDate>Sat, 16 May 2026 19:00:00 GMT</pubDate>
      <content:encoded>
        <![CDATA[<p>The Python desktop-app stack has a coordination problem.</p><ul><li>Qt wants to own the main thread.</li><li>asyncio wants to own the event loop.</li><li>Your audio capture wants a background thread.</li><li>Your LLM stream wants to be an async generator.</li><li>Your hotkey listener wants OS-level callbacks.</li></ul><p>GhostPilot runs all five of these in one process, on Windows, without a single deadlock or threading bug. The glue is <code>qasync</code>, plus a small set of discipline rules. Here’s what survived contact with production.</p><h2 id="The-architecture"><a href="#The-architecture" class="headerlink" title="The architecture"></a>The architecture</h2><pre class="mermaid">flowchart TB    subgraph Main[Qt main thread<br/>= asyncio event loop via qasync]        UI[OverlayUI widgets]        LLM[LLMEngine.generate_*_stream<br/>async generators]        Replay[session_replay tasks]        Watcher[QFileSystemWatcher<br/>RAG hot-reload]    end    subgraph Workers[Background threads]        Audio[sounddevice / WASAPI capture]        ASR[Azure Speech SDK callback]        Hotkey[keyboard hook]    end    Audio -->|asyncio.run_coroutine_threadsafe| Q1    ASR -->|asyncio.run_coroutine_threadsafe| Q1    Hotkey -->|QMetaObject.invokeMethod| UI    Q1[(asyncio.Queue)] --> LLM    LLM --> UI    Replay --> UI    Watcher --> RAG    UI -.uses.-> LLM</pre><p>Rule 1: <strong>everything async runs on the Qt thread.</strong> Workers cross the boundary in exactly one of two ways.</p><h2 id="qasync-in-10-lines"><a href="#qasync-in-10-lines" class="headerlink" title="qasync in 10 lines"></a>qasync in 10 lines</h2><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> asyncio</span><br><span class="line"><span class="keyword">from</span> PyQt6.QtWidgets <span class="keyword">import</span> QApplication</span><br><span class="line"><span class="keyword">import</span> qasync</span><br><span class="line"></span><br><span class="line">app = QApplication(sys.argv)</span><br><span class="line">loop = qasync.QEventLoop(app)</span><br><span class="line">asyncio.set_event_loop(loop)</span><br><span class="line"></span><br><span class="line"><span class="keyword">with</span> loop:</span><br><span class="line">    loop.run_until_complete(main())  <span class="comment"># main() is an async coroutine</span></span><br></pre></td></tr></table></figure><p>That’s the entire integration. <code>qasync</code> is a Qt event loop that also runs asyncio callbacks. <code>await asyncio.sleep(0.1)</code> works, <code>loop.create_task(...)</code> works, and Qt slots fire on the same thread.</p><p>The trap: it is <strong>one loop on one thread</strong>. Anything that tries to spawn a second loop (a library that calls <code>asyncio.run(...)</code> internally, a thread that calls <code>asyncio.get_event_loop()</code>) will explode.</p><h2 id="Crossing-the-boundary-two-patterns"><a href="#Crossing-the-boundary-two-patterns" class="headerlink" title="Crossing the boundary, two patterns"></a>Crossing the boundary, two patterns</h2><h3 id="Worker-thread-→-asyncio-task"><a href="#Worker-thread-→-asyncio-task" class="headerlink" title="Worker thread → asyncio task"></a>Worker thread → asyncio task</h3><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">_on_audio_chunk</span>(<span class="params">chunk: <span class="built_in">bytes</span></span>) -&gt; <span class="literal">None</span>:</span><br><span class="line">    <span class="string">&quot;&quot;&quot;Called from sounddevice&#x27;s audio thread.&quot;&quot;&quot;</span></span><br><span class="line">    asyncio.run_coroutine_threadsafe(audio_queue.put(chunk), loop)</span><br></pre></td></tr></table></figure><p><code>run_coroutine_threadsafe</code> schedules a coroutine on the loop and returns a <code>concurrent.futures.Future</code>. Critically: it does <em>not</em> block the calling thread. The audio callback returns in microseconds.</p><h3 id="Worker-thread-→-Qt-widget"><a href="#Worker-thread-→-Qt-widget" class="headerlink" title="Worker thread → Qt widget"></a>Worker thread → Qt widget</h3><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">_on_hotkey</span>():</span><br><span class="line">    <span class="string">&quot;&quot;&quot;Called from the keyboard hook thread.&quot;&quot;&quot;</span></span><br><span class="line">    QMetaObject.invokeMethod(ui_asr, <span class="string">&quot;toggle_visibility&quot;</span>,</span><br><span class="line">                             Qt.ConnectionType.QueuedConnection)</span><br></pre></td></tr></table></figure><p><code>QueuedConnection</code> posts the call to the Qt event loop. The target slot runs on the Qt thread on the next iteration.</p><p><strong>Never touch a QWidget from a non-Qt thread.</strong> Even setting <code>widget.setText(...)</code> from a worker thread will crash, eventually, in a way that doesn’t reproduce.</p><h2 id="Async-generators-are-the-right-shape-for-LLM-streams"><a href="#Async-generators-are-the-right-shape-for-LLM-streams" class="headerlink" title="Async generators are the right shape for LLM streams"></a>Async generators are the right shape for LLM streams</h2><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">LLMEngine</span>:</span><br><span class="line">    <span class="keyword">async</span> <span class="keyword">def</span> <span class="title function_">generate_answer_stream</span>(<span class="params">self, question, ui_queue, *, q_type=<span class="literal">None</span></span>):</span><br><span class="line">        <span class="keyword">async</span> <span class="keyword">for</span> delta <span class="keyword">in</span> <span class="variable language_">self</span>.text_provider.chat_stream(messages, model=<span class="variable language_">self</span>.model):</span><br><span class="line">            <span class="keyword">await</span> ui_queue.put(&#123;<span class="string">&quot;type&quot;</span>: <span class="string">&quot;token&quot;</span>, <span class="string">&quot;text&quot;</span>: delta.text&#125;)</span><br><span class="line">        <span class="keyword">await</span> ui_queue.put(&#123;<span class="string">&quot;type&quot;</span>: <span class="string">&quot;usage&quot;</span>, ...&#125;)</span><br></pre></td></tr></table></figure><p>The UI consumer is dead simple:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">async</span> <span class="keyword">def</span> <span class="title function_">ui_updater</span>():</span><br><span class="line">    <span class="keyword">while</span> <span class="literal">True</span>:</span><br><span class="line">        msg = <span class="keyword">await</span> ui_queue.get()</span><br><span class="line">        <span class="keyword">if</span> msg[<span class="string">&quot;type&quot;</span>] == <span class="string">&quot;token&quot;</span>:</span><br><span class="line">            overlay.append_token(msg[<span class="string">&quot;text&quot;</span>])</span><br><span class="line">        <span class="keyword">elif</span> msg[<span class="string">&quot;type&quot;</span>] == <span class="string">&quot;usage&quot;</span>:</span><br><span class="line">            overlay.flush_footer(msg)</span><br></pre></td></tr></table></figure><p>This shape gives you, for free:</p><ul><li>Backpressure (the queue fills if the UI is slow).</li><li>Cancellation (cancel the producer task, the queue drains, the consumer keeps going).</li><li>Recording (a second consumer tees tokens to disk — that’s how the session recorder works).</li></ul><h2 id="Cancellation-across-the-boundary"><a href="#Cancellation-across-the-boundary" class="headerlink" title="Cancellation across the boundary"></a>Cancellation across the boundary</h2><p>Long-running replays need a Cancel button. The pattern:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># UI thread (Sessions tab)</span></span><br><span class="line"><span class="variable language_">self</span>._cancel_event = asyncio.Event()  <span class="comment"># safe because qasync = same thread</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">_on_cancel_clicked</span>(<span class="params">self</span>):</span><br><span class="line">    <span class="variable language_">self</span>._cancel_event.<span class="built_in">set</span>()</span><br><span class="line"></span><br><span class="line"><span class="comment"># Replay task</span></span><br><span class="line"><span class="keyword">async</span> <span class="keyword">def</span> <span class="title function_">_run</span>():</span><br><span class="line">    <span class="keyword">for</span> i, turn <span class="keyword">in</span> <span class="built_in">enumerate</span>(turns, <span class="number">1</span>):</span><br><span class="line">        <span class="keyword">if</span> cancel_event.is_set():</span><br><span class="line">            <span class="keyword">break</span></span><br><span class="line">        <span class="keyword">await</span> session_replay.replay_turn(engine, turn)</span><br></pre></td></tr></table></figure><p>Crucial detail: <strong>the in-flight turn runs to completion.</strong> Cancelling it mid-stream would leave a half-finished assistant row in the recording. The cancel is checked <strong>between</strong> turns, never within them.</p><p>This is a general principle: <strong>cancellation points are negotiated, not imposed.</strong> A “cancel everything right now” is almost always wrong; the producer needs to land on a clean state.</p><h2 id="QFileSystemWatcher-asyncio-an-unexpected-pairing"><a href="#QFileSystemWatcher-asyncio-an-unexpected-pairing" class="headerlink" title="QFileSystemWatcher + asyncio: an unexpected pairing"></a>QFileSystemWatcher + asyncio: an unexpected pairing</h2><p>The RAG knowledge base hot-reloads when a markdown file under <code>knowledge/</code> changes. Naively, the watcher’s <code>directoryChanged</code> signal fires on the Qt thread — but rebuilding the BM25 index takes ~500ms and would freeze the UI.</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">watcher.directoryChanged.connect(</span><br><span class="line">    <span class="keyword">lambda</span> path: loop.create_task(rebuild_kb_async())</span><br><span class="line">)</span><br></pre></td></tr></table></figure><p><code>loop.create_task(...)</code> is the bridge. The Qt slot returns immediately; the rebuild runs as an asyncio task that yields back to the loop between chunks. UI stays responsive. No thread, no lock, no manual <code>QThread</code> boilerplate.</p><p>This is the unsung superpower of qasync: <strong>the Qt thread <em>is</em> the asyncio thread, so you never need a thread for “non-blocking but long-running” work.</strong> Make it an async task and it just yields.</p><h2 id="What-I-will-not-do-again"><a href="#What-I-will-not-do-again" class="headerlink" title="What I will not do again"></a>What I will not do again</h2><pre class="mermaid">graph LR    A[Bug] --> B[Thought:<br/>spawn a QThread]    B --> C[Now you have<br/>two threads]    C --> D[Race conditions]    D --> A</pre><p>I started this project with a <code>QThread</code> per worker. Every interesting bug was a race condition between Qt’s event loop and the worker thread. The rewrite to “one qasync loop, workers are thin shims that marshal back via <code>run_coroutine_threadsafe</code>“ deleted the entire category.</p><p>If you reach for <code>QThread</code>, stop and ask: can this work be an async task instead? Almost always, yes.</p><h2 id="Recap"><a href="#Recap" class="headerlink" title="Recap"></a>Recap</h2><table><thead><tr><th>Need</th><th>Tool</th></tr></thead><tbody><tr><td>Streaming LLM output to UI</td><td>async generator → <code>asyncio.Queue</code> → consumer task</td></tr><tr><td>OS callback (audio, hotkey) → async logic</td><td><code>asyncio.run_coroutine_threadsafe</code></td></tr><tr><td>OS callback → Qt widget</td><td><code>QMetaObject.invokeMethod(..., QueuedConnection)</code></td></tr><tr><td>File system watcher → expensive work</td><td>qasync slot → <code>loop.create_task</code></td></tr><tr><td>Long task cancellation</td><td><code>asyncio.Event</code> checked at clean boundaries</td></tr><tr><td><strong>Anything else that wants a thread</strong></td><td>First try: make it an async task</td></tr></tbody></table><p>One loop. One thread. One mental model. The whole app fits in your head.</p>]]>
      </content:encoded>
    </item>
    <item>
      <title>Your LLM retry loop is probably wrong</title>
      <link>https://faketut.github.io/2026/05/16/ghostpilot-02-failover-classify/</link>
      <description>
        <![CDATA[<p>A bad retry loop is a strict upgrade over no retry loop, until it isn’t.</p>
<p>Here’s the failure mode I shipped, then fixed:</p>
<block]]>
      </description>
      <author>Jian Feng</author>
      <category domain="https://faketut.github.io/categories/engineering/">engineering</category>
      <category domain="https://faketut.github.io/categories/engineering/desktop/">desktop</category>
      <category domain="https://faketut.github.io/tags/ghostpilot/">ghostpilot</category>
      <category domain="https://faketut.github.io/tags/pyqt/">pyqt</category>
      <category domain="https://faketut.github.io/tags/llm/">llm</category>
      <pubDate>Sat, 16 May 2026 13:00:00 GMT</pubDate>
      <content:encoded>
        <![CDATA[<p>A bad retry loop is a strict upgrade over no retry loop, until it isn’t.</p><p>Here’s the failure mode I shipped, then fixed:</p><blockquote><p>A user rotated their DeepSeek API key. They forgot to update <code>.env</code>. Every LLM call hit 401 Unauthorized. The retry loop dutifully retried each call three times, then failed over to Gemini — which also got 401 because the user had pasted the DeepSeek key into the Gemini field. Total: <strong>six network round-trips, six log lines, ~8 seconds of latency, identical result to giving up immediately.</strong></p></blockquote><p>The fix is to classify the error <em>before</em> deciding whether to retry, and <em>before</em> deciding whether to fail over.</p><h2 id="Three-buckets-not-one"><a href="#Three-buckets-not-one" class="headerlink" title="Three buckets, not one"></a>Three buckets, not one</h2><pre class="mermaid">flowchart TD    Err[Provider error] --> Q{Classify}    Q -->|fatal<br/>400/401/403/404| F[Raise immediately<br/>no retry, no failover]    Q -->|retryable<br/>408/425/429/5xx, network, timeout| R[Retry in place<br/>with backoff]    R -->|retries exhausted| N[Move to next provider]    Q -->|unknown| N</pre><p>Three rules carry the whole design:</p><table><thead><tr><th>Bucket</th><th>Why this action</th></tr></thead><tbody><tr><td><strong>fatal</strong> — <code>400/401/403/404</code>, “Invalid API key”, “Bad request”</td><td>Switching providers can’t fix a missing key. Retrying can’t fix a malformed request. Both add latency and burn quota.</td></tr><tr><td><strong>retryable</strong> — <code>408/425/429/5xx</code>, <code>TimeoutError</code>, <code>ConnectionError</code></td><td>The same provider will probably succeed on retry. Switching providers loses session context (e.g. usage caches). Backoff first, then fail over if the provider is genuinely down.</td></tr><tr><td><strong>unknown</strong></td><td>Conservative: don’t retry in place (could be fatal), but do try the fallback (the fallback might work).</td></tr></tbody></table><h2 id="The-classifier-is-the-entire-trick"><a href="#The-classifier-is-the-entire-trick" class="headerlink" title="The classifier is the entire trick"></a>The classifier is the entire trick</h2><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br></pre></td><td class="code"><pre><span class="line">_FATAL_STATUS = &#123;<span class="number">400</span>, <span class="number">401</span>, <span class="number">403</span>, <span class="number">404</span>&#125;</span><br><span class="line">_RETRYABLE_STATUS = &#123;<span class="number">408</span>, <span class="number">409</span>, <span class="number">425</span>, <span class="number">429</span>, <span class="number">500</span>, <span class="number">502</span>, <span class="number">503</span>, <span class="number">504</span>&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">classify_error</span>(<span class="params">err</span>) -&gt; <span class="built_in">str</span>:</span><br><span class="line">    <span class="comment"># 1) Structured status code from SDK exception</span></span><br><span class="line">    status = <span class="built_in">getattr</span>(err, <span class="string">&quot;status_code&quot;</span>, <span class="literal">None</span>) <span class="keyword">or</span> <span class="built_in">getattr</span>(err, <span class="string">&quot;status&quot;</span>, <span class="literal">None</span>)</span><br><span class="line">    <span class="keyword">if</span> status <span class="keyword">is</span> <span class="literal">None</span>:</span><br><span class="line">        resp = <span class="built_in">getattr</span>(err, <span class="string">&quot;response&quot;</span>, <span class="literal">None</span>)</span><br><span class="line">        <span class="keyword">if</span> resp <span class="keyword">is</span> <span class="keyword">not</span> <span class="literal">None</span>:</span><br><span class="line">            status = <span class="built_in">getattr</span>(resp, <span class="string">&quot;status_code&quot;</span>, <span class="literal">None</span>)</span><br><span class="line">    <span class="keyword">if</span> status <span class="keyword">in</span> _FATAL_STATUS:     <span class="keyword">return</span> <span class="string">&quot;fatal&quot;</span></span><br><span class="line">    <span class="keyword">if</span> status <span class="keyword">in</span> _RETRYABLE_STATUS: <span class="keyword">return</span> <span class="string">&quot;retryable&quot;</span></span><br><span class="line"></span><br><span class="line">    <span class="comment"># 2) Exception class (network layer)</span></span><br><span class="line">    name = <span class="built_in">type</span>(err).__name__.lower()</span><br><span class="line">    <span class="keyword">if</span> <span class="built_in">any</span>(s <span class="keyword">in</span> name <span class="keyword">for</span> s <span class="keyword">in</span> (<span class="string">&quot;timeout&quot;</span>, <span class="string">&quot;connection&quot;</span>, <span class="string">&quot;network&quot;</span>)):</span><br><span class="line">        <span class="keyword">return</span> <span class="string">&quot;retryable&quot;</span></span><br><span class="line"></span><br><span class="line">    <span class="comment"># 3) String regex on str(err) — last resort, but worth it</span></span><br><span class="line">    msg = <span class="built_in">str</span>(err).lower()</span><br><span class="line">    <span class="keyword">if</span> re.search(<span class="string">r&quot;\b(invalid api key|unauthorized|bad request)\b&quot;</span>, msg):</span><br><span class="line">        <span class="keyword">return</span> <span class="string">&quot;fatal&quot;</span></span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> <span class="string">&quot;unknown&quot;</span></span><br></pre></td></tr></table></figure><p>A few things to notice:</p><ul><li><strong>Status codes are checked first</strong> because they’re the most reliable signal.</li><li><strong>Exception class is second</strong> because network errors don’t carry HTTP status codes.</li><li><strong>Regex on the message is last</strong> and intentionally narrow. It catches the OpenAI-SDK case where a 401 was wrapped in a <code>RuntimeError</code> with the original message but no status.</li></ul><h2 id="Retry-with-exponential-backoff-bounded"><a href="#Retry-with-exponential-backoff-bounded" class="headerlink" title="Retry with exponential backoff, bounded"></a>Retry with exponential backoff, <em>bounded</em></h2><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">async</span> <span class="keyword">def</span> <span class="title function_">chat_stream</span>(<span class="params">self, ...</span>):</span><br><span class="line">    <span class="keyword">for</span> provider <span class="keyword">in</span> [<span class="variable language_">self</span>.primary, *<span class="variable language_">self</span>.fallbacks]:</span><br><span class="line">        <span class="keyword">for</span> attempt <span class="keyword">in</span> <span class="built_in">range</span>(<span class="variable language_">self</span>.retries_per_provider + <span class="number">1</span>):</span><br><span class="line">            <span class="keyword">try</span>:</span><br><span class="line">                <span class="keyword">async</span> <span class="keyword">for</span> delta <span class="keyword">in</span> provider.chat_stream(...):</span><br><span class="line">                    <span class="keyword">yield</span> delta</span><br><span class="line">                <span class="keyword">return</span></span><br><span class="line">            <span class="keyword">except</span> Exception <span class="keyword">as</span> e:</span><br><span class="line">                kind = classify_error(e)</span><br><span class="line">                <span class="keyword">if</span> kind == <span class="string">&quot;fatal&quot;</span>:</span><br><span class="line">                    <span class="keyword">raise</span></span><br><span class="line">                <span class="keyword">if</span> kind == <span class="string">&quot;retryable&quot;</span> <span class="keyword">and</span> attempt &lt; <span class="variable language_">self</span>.retries_per_provider:</span><br><span class="line">                    <span class="keyword">await</span> asyncio.sleep(<span class="variable language_">self</span>.backoff_base * (<span class="number">2</span> ** attempt))</span><br><span class="line">                    <span class="keyword">continue</span></span><br><span class="line">                <span class="keyword">break</span>  <span class="comment"># move to next provider</span></span><br><span class="line">    <span class="keyword">raise</span>  <span class="comment"># all providers exhausted</span></span><br></pre></td></tr></table></figure><p>Defaults: <code>retries_per_provider=1</code>, <code>backoff_base=0.5</code>. So a single retryable error costs 0.5s, then moves on. The worst case across two providers is <code>0.5 + 1.0 + 0 = 1.5s</code> before raising.</p><p><strong>One bounded retry per provider is almost always the right default.</strong> Two is paranoid, three is hostile.</p><h2 id="Mid-stream-errors-are-not-retryable"><a href="#Mid-stream-errors-are-not-retryable" class="headerlink" title="Mid-stream errors are not retryable"></a>Mid-stream errors are not retryable</h2><pre class="mermaid">sequenceDiagram    participant Client    participant Provider    Client->>Provider: chat_stream(...)    Provider-->>Client: token "Hello"    Provider-->>Client: token " world"    Provider--xClient: ConnectionError mid-stream    Note over Client: Do NOT restart.<br/>The user already saw "Hello world".</pre><p>If tokens have already been emitted to the UI, retrying the call would replay them — the user sees <code>Hello worldHello world, my name is...</code>. Worse, vision replays would re-bill the image. The right behavior is to let the error propagate to the UI as a stream interruption.</p><p>The implementation:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">async</span> <span class="keyword">def</span> <span class="title function_">chat_stream</span>(<span class="params">self, ...</span>):</span><br><span class="line">    emitted = <span class="literal">False</span></span><br><span class="line">    <span class="keyword">try</span>:</span><br><span class="line">        <span class="keyword">async</span> <span class="keyword">for</span> delta <span class="keyword">in</span> <span class="variable language_">self</span>._with_failover(...):</span><br><span class="line">            emitted = <span class="literal">True</span></span><br><span class="line">            <span class="keyword">yield</span> delta</span><br><span class="line">    <span class="keyword">except</span> Exception:</span><br><span class="line">        <span class="keyword">if</span> emitted:</span><br><span class="line">            <span class="keyword">raise</span>  <span class="comment"># do not restart; surface to UI</span></span><br><span class="line">        <span class="comment"># else: failover already tried in _with_failover</span></span><br><span class="line">        <span class="keyword">raise</span></span><br></pre></td></tr></table></figure><h2 id="Observability-is-half-the-value"><a href="#Observability-is-half-the-value" class="headerlink" title="Observability is half the value"></a>Observability is half the value</h2><p>Every failover writes a single info event onto the UI queue:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">&#123;<span class="string">&quot;type&quot;</span>: <span class="string">&quot;info&quot;</span>, <span class="string">&quot;kind&quot;</span>: <span class="string">&quot;vision&quot;</span>, <span class="string">&quot;provider&quot;</span>: <span class="string">&quot;gemini&quot;</span>,</span><br><span class="line"> <span class="string">&quot;note&quot;</span>: <span class="string">&quot;failover from openai&quot;</span>, <span class="string">&quot;error&quot;</span>: <span class="string">&quot;openai: 503&quot;</span>&#125;</span><br></pre></td></tr></table></figure><p>The overlay footer shows it for one second. That’s enough for the user to know “DeepSeek is down, you’re on Gemini” without staring at logs.</p><h2 id="Test-matrix-that-earned-its-keep"><a href="#Test-matrix-that-earned-its-keep" class="headerlink" title="Test matrix that earned its keep"></a>Test matrix that earned its keep</h2><pre class="mermaid">graph LR    T1[401 from primary] --> A1[no retry, no failover, raise]    T2[429 from primary] --> A2[retry once in 0.5s, succeed]    T3[503 from primary] --> A3[retry, still fail, try fallback, succeed]    T4[ConnectionError mid-stream] --> A4[propagate, no restart]    T5[Unknown exception] --> A5[fail over without retry]</pre><p>Five tests. Five real production scenarios. Each one used to be a bug.</p><h2 id="TL-DR"><a href="#TL-DR" class="headerlink" title="TL;DR"></a>TL;DR</h2><ul><li>Classify errors into <code>fatal | retryable | unknown</code> before deciding.</li><li>Retry in place at most once with exponential backoff.</li><li>Fail over only after retries are exhausted <em>or</em> the error is unknown.</li><li>Never retry once tokens have been emitted to the user.</li><li>Surface every switch on the UI for one second.</li></ul><p>The whole module is ~250 lines. It used to be ~80 and behaved much worse.</p>]]>
      </content:encoded>
    </item>
  </channel>
</rss>
