<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title><![CDATA[nilenso blog]]></title>
  <link href="http://blog.nilenso.com/atom.xml" rel="self"/>
  <link href="http://blog.nilenso.com/"/>
  <updated>2026-04-09T16:17:14+00:00</updated>
  <id>http://blog.nilenso.com/</id>
  <author>
    <name><![CDATA[nilenso]]></name>
    
  </author>
  <generator uri="http://octopress.org/">Octopress</generator>

  
  <entry>
    
    <title type="html"><![CDATA[Checking my model vibes against SWE-Bench Pro]]></title>
    <link href="http://blog.nilenso.com/blog/2026/04/08/checking-my-model-vibes-against-swe-bench-pro/"/>
    
    <updated>2026-04-08T00:00:00+00:00</updated>
    <id>http://blog.nilenso.com/blog/2026/04/08/checking-my-model-vibes-against-swe-bench-pro</id>
    <summary type="html"><![CDATA[
]]></summary>
    
    <content type="html"><![CDATA[<style>
.ratio-wrap { margin: 1.2rem 0 1.6rem; }
.ratio-wrap, .ratio-wrap * { font-family: var(--font-sans); }
.ratio-card { margin: 0.8rem 0 1.4rem; }
.ratio-title { font-size: 1.25rem; line-height: 1.35; font-weight: 600; margin: 0 0 0.35rem; }
.ratio-sub { color: #111; font-size: 1rem; margin: 0 0 0.7rem; }
.ratio-ratio-note {
  color: var(--nilenso-pink, #FF3D84);
  font-size: 0.95rem;
  margin: 0.35rem 0 0;
  white-space: nowrap;
  text-align: center;
}
.ratio-stat { color: var(--nilenso-pink, #FF3D84); font-size: 0.95rem; margin: 0.3rem 0 0; }
.ratio-caption {
  margin: 0 0 2em;
  padding-bottom: 0.75rem;
  text-align: center;
  border-bottom: 1px dashed gray;
}
.ratio-svg { display: block; width: 100%; height: auto; }
.ratio-muted { color: #777; font-size: 0.95rem; }

.annex-wrap { margin: 1rem 0 2rem; }
.annex-table {
  border-collapse: collapse;
  width: 100%;
  margin: 0;
  font-family: var(--font-monospace);
  font-size: 0.9rem;
}
.annex-table th, .annex-table td {
  text-align: right;
  padding: 6px 10px;
  border-bottom: 1px solid #ececec;
  white-space: nowrap;
}
.annex-table th {
  color: #888;
  font-size: 0.7rem;
  text-transform: uppercase;
  letter-spacing: 0.04em;
  font-family: var(--font-sans);
  font-weight: 500;
}
.annex-table td:first-child,
.annex-table th:first-child {
  text-align: left;
}
.annex-table td:first-child {
  white-space: normal;
  max-width: 380px;
  font-family: var(--font-serif);
  font-size: 0.98rem;
}
.annex-desc {
  display: block;
  color: #888;
  font-size: 0.8rem;
  font-family: var(--font-serif);
}
.annex-note {
  color: #666;
  font-size: 0.92rem;
  margin: 0.45rem 0 0;
}
.annex-scroll {
  max-height: 560px;
  overflow: auto;
  border: 1px solid #e6e6e6;
  border-radius: 6px;
}
.annex-scroll .annex-table th {
  position: sticky;
  top: 0;
  background: #fafafa;
  z-index: 1;
}
</style>

<p>If you choose coding models for production work, resolve rate is only part of the decision. Cost, tokens, and runtime decide whether a model is workable day to day. My intuition was: GPT models felt slow and token hungry, Claude models felt faster on similar tasks, so Claude should be cheaper. Similar claims appear in public writeups, for example in the <a href="https://openhands.dev/blog/openhands-index">OpenHands Index</a>, where Opus 4.5 is noted as finishing tasks quickly despite its size.</p>

<p>Some benchmark operators already publish this view. <a href="https://swe-rebench.com/">SWE-rebench</a> and OpenHands Index include cost and runtime alongside accuracy metrics. I focused on SWE-Bench Pro because it is the one I understand best. Our <a href="https://blog.nilenso.com/blog/2025/09/25/swe-benchmarks/">earlier write-up</a> covers why Pro is the best available alternative, and <a href="https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/">OpenAI has made the same shift</a>. SWE-Bench Pro publishes trajectories, but it does not publish a consolidated cost-token-time report.</p>

<p>So I built one from the public data. One important choice is in <strong>pairing</strong>: I compare each task only when both models submitted on that same instance, then compute Sonnet 4.5 / GPT-5 ratios per task and summarize those ratios. This keeps the comparison grounded because every ratio comes from the same task under the same harness setup. I analyzed October 2025 paired runs in SWE-Bench Pro, and I would like to run the same analysis on current leaderboard pairs like GPT-4 vs Opus-4.6 once comparable paired data is available.</p>

<p>My intuitions were wrong.</p>

<h2 id="high-level-results">High-level results</h2>

<div id="ratio-summary" class="ratio-wrap"><p class="ratio-muted">Loading chart...</p></div>

<p>Across 616 paired tasks, resolve rates are close while operating profile is not. The results below come from the October 2025 GPT-5 vs Sonnet 4.5 paired trajectories.</p>

<ul>
  <li><strong>Cost:</strong> median ratio <strong>6.33x</strong> (Sonnet 4.5 / GPT-5)</li>
  <li><strong>Total tokens:</strong> median ratio <strong>1.15x</strong></li>
  <li><strong>Tool time:</strong> median ratio <strong>1.35x</strong></li>
  <li><strong>Resolve rate:</strong> GPT-5 <strong>42.5%</strong>, Sonnet 4.5 <strong>44.5%</strong></li>
</ul>

<p>Accuracy is similar here. The larger differences show up in cost, tokens, and runtime.</p>

<h2 id="cost">Cost</h2>

<div id="ratio-cost" class="ratio-wrap"><p class="ratio-muted">Loading chart...</p></div>

<p>Cost shows the largest gap in this dataset. Sonnet 4.5 is more expensive on most tasks, with many tasks clustered between 4x and 10x.</p>

<p>Caveats:</p>
<ul>
  <li>Costs come from benchmark run logs, not a rerun done today.</li>
  <li>These are litellm proxy costs from the benchmark environment, not public list pricing.</li>
  <li>Discounts, caching, and contract terms can change absolute dollars.</li>
  <li>This is one benchmark setup, not every real-world coding workflow.</li>
</ul>

<h2 id="tokens">Tokens</h2>

<div id="ratio-tokens-total" class="ratio-wrap"><p class="ratio-muted">Loading chart...</p></div>

<p>Total token usage is mixed and much less extreme than cost. Sonnet 4.5 uses more total tokens in 58.0% of tasks, with a broad spread around 1x. The largest token difference is in output, not input, and in the deeper breakdown Sonnet 4.5 emits much more output per task; a lot of that output is tool-call content and temporary file creation that gets discarded and never appears in the final submitted patch.</p>

<p>Caveats:</p>
<ul>
  <li>This chart uses total tokens from run stats: <code class="language-plaintext highlighter-rouge">tokens_sent + tokens_received</code>.</li>
  <li>SWE-Agent’s built-in <code class="language-plaintext highlighter-rouge">tokens_received</code> only counts <code class="language-plaintext highlighter-rouge">message.content</code> and misses tool-call arguments, so in the deeper token breakdown I re-count output tokens with tiktoken (<code class="language-plaintext highlighter-rouge">cl100k_base</code>) across <code class="language-plaintext highlighter-rouge">message.content</code>, <code class="language-plaintext highlighter-rouge">message.thought</code>, and <code class="language-plaintext highlighter-rouge">tool_calls[].function.arguments</code>.</li>
  <li>Hidden reasoning tokens are not visible in trajectories.</li>
  <li>Token behavior depends on harness strategy, turn limits, and repo shape.</li>
</ul>

<h2 id="time">Time</h2>

<div id="ratio-time" class="ratio-wrap"><p class="ratio-muted">Loading chart...</p></div>

<p>Time is more comparable than cost and noisier across tasks. Sonnet 4.5 is slower in 58.9% of tasks, with a median ratio of 1.35x.</p>

<p>Caveats:</p>
<ul>
  <li>This is <strong>tool execution time</strong> from trajectories.</li>
  <li>It is not full wall-clock latency.</li>
  <li>Model inference latency is not fully captured in this metric.</li>
  <li>Test-suite length and repo workflows add variance.</li>
</ul>

<h2 id="conclusion">Conclusion</h2>

<p>This SWE-Bench Pro slice is a signal, not a universal ranking. I do not expect October 2025 GPT-5 and Sonnet 4.5 behavior to generalize cleanly to newer model releases, but it is enough to challenge my default, so I will be trying GPT models more often now. The broader takeaway is to measure on your own use case with cost, token, and time data, and avoid decisions driven by timeline takes or out-of-context benchmark reporting.</p>

<h2 id="annex-methodology-and-additional-data">Annex: methodology and additional data</h2>

<p>This annex lists implementation details, additional ratio charts, and full comparative tables for the same paired set used above.</p>

<p>Analysis code and data: <a href="https://github.com/nilenso/swe-bench-pro-cost-token-time-analysis">github.com/nilenso/swe-bench-pro-cost-token-time-analysis</a>.</p>

<h3 id="a1-data-and-pairing-protocol">A.1 Data and pairing protocol</h3>

<ul>
  <li>Source: public SWE-Bench Pro trajectories from Scale AI.</li>
  <li>Pairing unit: same <code class="language-plaintext highlighter-rouge">instance_id</code>, requiring both GPT-5 and Sonnet 4.5 to have <code class="language-plaintext highlighter-rouge">submitted=true</code>.</li>
  <li>Paired sample size: <strong>616</strong> instances.</li>
  <li>Unit of comparison: per-instance ratio, Sonnet 4.5 / GPT-5.</li>
  <li>Aggregation: medians and IQR over per-instance ratios, not pooled cross-task means.</li>
  <li>Scope: October 2025 model runs in one harness configuration.</li>
  <li>Environment note: costs are benchmark-environment proxy costs, not current list pricing.</li>
</ul>

<h3 id="a11-inclusion-and-denominator-audit">A.1.1 Inclusion and denominator audit</h3>

<p>All annex charts and tables use the same paired set unless stated otherwise.</p>

<div class="annex-wrap">
  <table class="annex-table">
    <tr><th>Metric</th><th>Included (n)</th><th>Excluded</th><th>Inclusion rule</th></tr>
    <tr><td>Cost ratio</td><td>616</td><td>0</td><td>both costs &gt; 0</td></tr>
    <tr><td>Total token ratio (`tokens_sent + tokens_received`)</td><td>616</td><td>0</td><td>both totals &gt; 0</td></tr>
    <tr><td>Tool-time ratio</td><td>616</td><td>0</td><td>both tool times &gt; 0</td></tr>
    <tr><td>Input token ratio</td><td>616</td><td>0</td><td>both `tokens_sent` &gt; 0</td></tr>
    <tr><td>Output token ratio (`output_tokens.total`)</td><td>616</td><td>0</td><td>both totals &gt; 0</td></tr>
    <tr><td>Step-count ratio</td><td>616</td><td>0</td><td>both step counts &gt; 0</td></tr>
  </table>
</div>

<p>Note: denominator stays constant across ratio charts, so cross-metric comparisons are on the same paired sample.</p>

<h3 id="a12-paired-outcome-decomposition">A.1.2 Paired outcome decomposition</h3>

<ul>
  <li>Both solved: <strong>206</strong></li>
  <li>GPT-5 only solved: <strong>56</strong></li>
  <li>Sonnet 4.5 only solved: <strong>68</strong></li>
  <li>Neither solved: <strong>286</strong></li>
</ul>

<p>Note: this decomposition separates overlap from directional wins on the same instances.</p>

<h3 id="a13-uncertainty-checks">A.1.3 Uncertainty checks</h3>

<ul>
  <li>Resolve rate, Wilson 95% CI:
    <ul>
      <li>GPT-5: <strong>42.5%</strong> (262/616), CI <strong>38.7%-46.5%</strong></li>
      <li>Sonnet 4.5: <strong>44.5%</strong> (274/616), CI <strong>40.6%-48.4%</strong></li>
    </ul>
  </li>
  <li>Paired discordant outcomes: GPT-only <strong>56</strong>, Sonnet-only <strong>68</strong>; McNemar exact p-value <strong>0.323</strong>.</li>
  <li>Median ratio 95% bootstrap CI (4,000 resamples):
    <ul>
      <li>Cost: <strong>6.33x</strong>, CI <strong>5.91x-6.68x</strong></li>
      <li>Total tokens: <strong>1.15x</strong>, CI <strong>1.06x-1.26x</strong></li>
      <li>Tool time: <strong>1.35x</strong>, CI <strong>1.18x-1.55x</strong></li>
    </ul>
  </li>
</ul>

<p>Note: in this paired slice, cost, total-token, and tool-time ratio signals are stable; resolve-rate differences are less decisive.</p>

<h3 id="a2-additional-ratio-charts-not-shown-above">A.2 Additional ratio charts (not shown above)</h3>

<p>Input token ratio per paired task.</p>
<div id="annex-ratio-input" class="ratio-wrap"><p class="ratio-muted">Loading chart...</p></div>

<p>Output token ratio per paired task.</p>
<div id="annex-ratio-output" class="ratio-wrap"><p class="ratio-muted">Loading chart...</p></div>

<p>Step-count ratio per paired task.</p>
<div id="annex-ratio-steps" class="ratio-wrap"><p class="ratio-muted">Loading chart...</p></div>

<h3 id="a3-comparative-tables">A.3 Comparative tables</h3>

<p>Cost summary across all paired tasks.</p>
<div id="annex-table-cost" class="annex-wrap"><p class="ratio-muted">Loading table...</p></div>

<p>Token and patch summary.</p>
<div id="annex-table-tokens" class="annex-wrap"><p class="ratio-muted">Loading table...</p></div>

<p>Action breakdown summary.</p>
<div id="annex-table-actions" class="annex-wrap"><p class="ratio-muted">Loading table...</p></div>

<p>Execution summary.</p>
<div id="annex-table-exec" class="annex-wrap"><p class="ratio-muted">Loading table...</p></div>

<p>Repo-level resolve counts.</p>
<div id="annex-table-repos" class="annex-wrap"><p class="ratio-muted">Loading table...</p></div>

<p>Per-instance comparison table.</p>
<div id="annex-table-instances" class="annex-wrap"><p class="ratio-muted">Loading table...</p></div>

<script>
(function () {
  function quantile(values, q) {
    if (!values.length) return NaN;
    const s = [...values].sort((a, b) => a - b);
    if (s.length === 1) return s[0];
    const pos = (s.length - 1) * q;
    const lo = Math.floor(pos), hi = Math.ceil(pos);
    if (lo === hi) return s[lo];
    const w = pos - lo;
    return s[lo] * (1 - w) + s[hi] * w;
  }

  function fmtx(v, d) {
    return v.toFixed(d === undefined ? 2 : d) + 'x';
  }

  function niceMax(n) {
    if (n <= 5) return 5;
    const p = Math.pow(10, Math.floor(Math.log10(n)));
    const m = n / p;
    if (m <= 1) return 1 * p;
    if (m <= 2) return 2 * p;
    if (m <= 5) return 5 * p;
    return 10 * p;
  }

  function mean(arr) {
    return arr.length ? arr.reduce((a, b) => a + b, 0) / arr.length : 0;
  }

  function sum(arr) {
    return arr.reduce((a, b) => a + b, 0);
  }

  function median(arr) {
    return quantile(arr, 0.5);
  }

  function fmt(n, dec = 0) {
    if (!isFinite(n)) return '—';
    return n.toLocaleString('en-US', { minimumFractionDigits: dec, maximumFractionDigits: dec });
  }

  function fmtD(n) {
    return '$' + fmt(n, 2);
  }

  function ratioSG(g, s) {
    if (!g || !isFinite(g) || !isFinite(s)) return '—';
    return (s / g).toFixed(1) + 'x';
  }

  function outcomeOf(p) {
    const gr = p.gpt5.resolved === true;
    const sr = p.claude.resolved === true;
    if (gr && sr) return 'both';
    if (gr) return 'gpt5-only';
    if (sr) return 'sonnet-only';
    return 'neither';
  }

  function aggRow(label, gVals, sVals, opts) {
    const fn = (opts && opts.fn) || mean;
    const dollar = opts && opts.dollar;
    const dec = opts && opts.dec !== undefined ? opts.dec : (dollar ? 2 : 0);
    const desc = opts && opts.desc ? `<span class="annex-desc">${opts.desc}</span>` : '';
    const g = fn(gVals), s = fn(sVals);
    const gStr = dollar ? fmtD(g) : fmt(g, dec);
    const sStr = dollar ? fmtD(s) : fmt(s, dec);
    return `<tr><td>${label}${desc}</td><td>${gStr}</td><td>${sStr}</td><td>${ratioSG(g, s)}</td></tr>`;
  }

  function renderAnnexTable(el, caption, htmlTable, note) {
    el.innerHTML = `${htmlTable}<p class="annex-note">${note}</p>`;
  }

  function renderHistogram(el, metric) {
    const ratios = metric.ratios;
    const logs = ratios.map(r => Math.log2(r));
    const lo = -4, hi = 5;
    const binsN = 34;
    const binW = (hi - lo) / binsN;
    const bins = new Array(binsN).fill(0);

    logs.forEach(v => {
      let i = Math.floor((v - lo) / binW);
      if (i < 0) i = 0;
      if (i >= binsN) i = binsN - 1;
      bins[i] += 1;
    });

    const W = 920, H = 360;
    const m = { l: 68, r: 18, t: 26, b: 66 };
    const pw = W - m.l - m.r;
    const ph = H - m.t - m.b;
    const yMax = niceMax(Math.max(...bins, 1));

    const x = (v) => m.l + ((v - lo) / (hi - lo)) * pw;
    const y = (v) => m.t + ph - (v / yMax) * ph;

    const yTicks = [0, yMax / 2, yMax];
    const xTicks = [-4, -3, -2, -1, 0, 1, 2, 3, 4, 5];

    let svg = '';

    yTicks.forEach(t => {
      svg += `<line x1="${m.l}" y1="${y(t)}" x2="${m.l + pw}" y2="${y(t)}" stroke="#f2f2f2" stroke-width="1"/>`;
      svg += `<text x="${m.l - 8}" y="${y(t) + 4}" fill="#555" font-size="11" text-anchor="end">${Math.round(t)}</text>`;
    });

    for (let i = 0; i < binsN; i++) {
      const bx0 = m.l + (i / binsN) * pw;
      const bx1 = m.l + ((i + 1) / binsN) * pw;
      const bh = (bins[i] / yMax) * ph;
      svg += `<rect x="${bx0 + 0.2}" y="${m.t + ph - bh}" width="${Math.max(0, bx1 - bx0 - 0.6)}" height="${bh}" fill="#202020"/>`;
    }

    svg += `<line x1="${x(0)}" y1="${m.t}" x2="${x(0)}" y2="${m.t + ph}" stroke="#bdbdbd" stroke-width="1" stroke-dasharray="4,4"/>`;

    const medX = x(Math.log2(metric.median));
    svg += `<line x1="${medX}" y1="${m.t}" x2="${medX}" y2="${m.t + ph}" stroke="#111" stroke-width="1.4"/>`;
    svg += `<text x="${Math.max(m.l + 34, Math.min(m.l + pw - 34, medX))}" y="${m.t - 8}" fill="#111" font-size="12" text-anchor="middle">Median ${fmtx(metric.median, 2)}</text>`;

    svg += `<text x="${m.l}" y="14" fill="#555" font-size="12" text-anchor="start">${metric.leftLabel}</text>`;
    svg += `<text x="${m.l + pw}" y="14" fill="#555" font-size="12" text-anchor="end">${metric.rightLabel}</text>`;

    svg += `<line x1="${m.l}" y1="${m.t + ph}" x2="${m.l + pw}" y2="${m.t + ph}" stroke="#e6e6e6" stroke-width="1"/>`;
    svg += `<line x1="${m.l}" y1="${m.t}" x2="${m.l}" y2="${m.t + ph}" stroke="#e6e6e6" stroke-width="1"/>`;

    xTicks.forEach(k => {
      const xv = x(k);
      const rv = Math.pow(2, k);
      const label = (rv >= 1 ? (Number.isInteger(rv) ? rv.toFixed(0) : rv.toFixed(2)) : rv.toFixed(3).replace(/0+$/, '').replace(/\.$/, '')) + 'x';
      svg += `<line x1="${xv}" y1="${m.t + ph}" x2="${xv}" y2="${m.t + ph + 5}" stroke="#e6e6e6" stroke-width="1"/>`;
      svg += `<text x="${xv}" y="${H - 28}" fill="#555" font-size="11" text-anchor="middle">${label}</text>`;
    });

    svg += `<text x="${m.l + pw / 2}" y="${H - 6}" fill="#111" font-size="12" text-anchor="middle">${metric.xLabel}</text>`;
    svg += `<text x="18" y="${m.t + ph / 2}" fill="#111" font-size="12" text-anchor="middle" transform="rotate(-90 18 ${m.t + ph / 2})">Tasks</text>`;

    el.innerHTML = `
      <section class="ratio-card">
        <p class="ratio-sub">${metric.title}. Ratio = Sonnet 4.5 / GPT-5, 1x = equal.</p>
        <svg class="ratio-svg" viewBox="0 0 ${W} ${H}" role="img" aria-label="${metric.title}">${svg}</svg>
        <div class="ratio-caption">
          <p class="ratio-stat">IQR ${fmtx(metric.q1, 2)}-${fmtx(metric.q3, 2)}; ${metric.takeawayPhrase} ${metric.higherCount} of ${metric.n} tasks (${metric.higherPct.toFixed(1)}%).</p>
        </div>
      </section>
    `;
  }

  function renderSummary(el, rows) {
    const whiskerLogs = rows.flatMap(r => [Math.log2(r.q1), Math.log2(r.q3)]).filter(v => isFinite(v));
    let lo = Math.min(...whiskerLogs) - 0.6;
    let hi = Math.max(...whiskerLogs) + 0.6;
    lo = Math.min(lo, -0.5);
    hi = Math.max(hi, 0.5);
    lo = Math.floor(lo * 2) / 2;
    hi = Math.ceil(hi * 2) / 2;

    const W = 920, H = 280;
    const m = { l: 150, r: 18, t: 18, b: 68 };
    const pw = W - m.l - m.r;
    const ph = H - m.t - m.b;

    const x = (lv) => m.l + ((lv - lo) / (hi - lo)) * pw;
    const rowY = (i) => m.t + ((i + 0.5) / rows.length) * ph;

    const allTicks = [-4, -3, -2, -1, 0, 1, 2, 3, 4, 5];
    const ticks = allTicks.filter(k => k >= lo && k <= hi);

    let svg = '';

    ticks.forEach(k => {
      const xv = x(k);
      const rv = Math.pow(2, k);
      const label = (rv >= 1 ? (Number.isInteger(rv) ? rv.toFixed(0) : rv.toFixed(2)) : rv.toFixed(3).replace(/0+$/, '').replace(/\.$/, '')) + 'x';
      svg += `<line x1="${xv}" y1="${m.t}" x2="${xv}" y2="${m.t + ph}" stroke="#f2f2f2" stroke-width="1"/>`;
      svg += `<text x="${xv}" y="${H - 26}" fill="#555" font-size="11" text-anchor="middle">${label}</text>`;
    });

    svg += `<line x1="${x(0)}" y1="${m.t}" x2="${x(0)}" y2="${m.t + ph}" stroke="#bdbdbd" stroke-width="1" stroke-dasharray="4,4"/>`;

    rows.forEach((r, i) => {
      const y = rowY(i);
      const x1 = x(Math.log2(r.q1));
      const x3 = x(Math.log2(r.q3));
      const xm = x(Math.log2(r.median));
      svg += `<line x1="${x1}" y1="${y}" x2="${x3}" y2="${y}" stroke="#111" stroke-width="1.6"/>`;
      svg += `<circle cx="${xm}" cy="${y}" r="3.2" fill="#111"/>`;
      svg += `<text x="${m.l - 12}" y="${y + 4}" fill="#111" font-size="12" text-anchor="end">${r.name}</text>`;
      svg += `<text x="${xm + 8}" y="${y - 6}" fill="#111" font-size="11" text-anchor="start">${fmtx(r.median, 2)}</text>`;
    });

    svg += `<line x1="${m.l}" y1="${m.t + ph}" x2="${m.l + pw}" y2="${m.t + ph}" stroke="#e6e6e6" stroke-width="1"/>`;
    svg += `<text x="${m.l + pw / 2}" y="${H - 6}" fill="#111" font-size="12" text-anchor="middle">Ratio</text>`;

    el.innerHTML = `
      <section class="ratio-card">
        <p class="ratio-sub">Sonnet 4.5 / GPT-5, 1x = equal, left of 1x means Sonnet is lower, right means Sonnet is higher.</p>
        <svg class="ratio-svg" viewBox="0 0 ${W} ${H}" role="img" aria-label="Typical per-task difference across metrics">${svg}</svg>
        <div class="ratio-caption">
          <p class="ratio-ratio-note">For cost and tokens, lower is better. For time, lower means faster.</p>
        </div>
      </section>
    `;
  }

  fetch('/assets/swe-bench-pro-data.json')
    .then(r => r.json())
    .then(data => {
      const pairs = data.filter(p => p.gpt5 && p.claude && p.gpt5.submitted && p.claude.submitted);
      const g = pairs.map(p => p.gpt5);
      const s = pairs.map(p => p.claude);
      const n = pairs.length;

      const metricDefs = {
        cost: {
          title: 'Per-task cost ratio',
          xLabel: 'Cost ratio',
          leftLabel: 'Sonnet cheaper',
          rightLabel: 'Sonnet more expensive',
          takeawayPhrase: 'Sonnet was more expensive on',
          g: (m) => m.model_stats.instance_cost,
          c: (m) => m.model_stats.instance_cost,
        },
        tokensTotal: {
          title: 'Per-task total token ratio',
          xLabel: 'Total token ratio',
          leftLabel: 'Sonnet fewer total tokens',
          rightLabel: 'Sonnet more total tokens',
          takeawayPhrase: 'Sonnet used more total tokens on',
          g: (m) => m.model_stats.tokens_sent + m.model_stats.tokens_received,
          c: (m) => m.model_stats.tokens_sent + m.model_stats.tokens_received,
        },
        time: {
          title: 'Per-task time ratio',
          xLabel: 'Time ratio',
          leftLabel: 'Sonnet faster',
          rightLabel: 'Sonnet slower',
          takeawayPhrase: 'Sonnet was slower on',
          g: (m) => m.tool_time.total_seconds,
          c: (m) => m.tool_time.total_seconds,
        },
        tokensInput: {
          title: 'Per-task input token ratio',
          xLabel: 'Input token ratio',
          leftLabel: 'Sonnet fewer input tokens',
          rightLabel: 'Sonnet more input tokens',
          takeawayPhrase: 'Sonnet used more input tokens on',
          g: (m) => m.model_stats.tokens_sent,
          c: (m) => m.model_stats.tokens_sent,
        },
        tokensOutput: {
          title: 'Per-task output token ratio',
          xLabel: 'Output token ratio',
          leftLabel: 'Sonnet fewer output tokens',
          rightLabel: 'Sonnet more output tokens',
          takeawayPhrase: 'Sonnet used more output tokens on',
          g: (m) => m.output_tokens.total,
          c: (m) => m.output_tokens.total,
        },
        steps: {
          title: 'Per-task steps ratio',
          xLabel: 'Steps ratio',
          leftLabel: 'Sonnet fewer steps',
          rightLabel: 'Sonnet more steps',
          takeawayPhrase: 'Sonnet took more steps on',
          g: (m) => m.steps,
          c: (m) => m.steps,
        },
      };

      function makeMetric(def) {
        const ratios = [];
        for (const p of pairs) {
          const gv = def.g(p.gpt5);
          const cv = def.c(p.claude);
          if (!gv || !cv || gv <= 0 || cv <= 0) continue;
          ratios.push(cv / gv);
        }
        const included = ratios.length;
        const higherCount = ratios.filter(v => v > 1).length;
        return {
          ...def,
          ratios,
          n: included,
          median: quantile(ratios, 0.5),
          q1: quantile(ratios, 0.25),
          q3: quantile(ratios, 0.75),
          higherCount,
          higherPct: included ? (higherCount * 100 / included) : 0,
        };
      }

      const mCost = makeMetric(metricDefs.cost);
      const mTokTotal = makeMetric(metricDefs.tokensTotal);
      const mTime = makeMetric(metricDefs.time);
      const mTokInput = makeMetric(metricDefs.tokensInput);
      const mTokOutput = makeMetric(metricDefs.tokensOutput);
      const mSteps = makeMetric(metricDefs.steps);

      renderSummary(document.getElementById('ratio-summary'), [
        { name: 'Cost', median: mCost.median, q1: mCost.q1, q3: mCost.q3 },
        { name: 'Tokens', median: mTokTotal.median, q1: mTokTotal.q1, q3: mTokTotal.q3 },
        { name: 'Time', median: mTime.median, q1: mTime.q1, q3: mTime.q3 },
      ]);

      renderHistogram(document.getElementById('ratio-cost'), mCost);
      renderHistogram(document.getElementById('ratio-tokens-total'), mTokTotal);
      renderHistogram(document.getElementById('ratio-time'), mTime);

      // Annex ratio charts
      renderHistogram(document.getElementById('annex-ratio-input'), mTokInput);
      renderHistogram(document.getElementById('annex-ratio-output'), mTokOutput);
      renderHistogram(document.getElementById('annex-ratio-steps'), mSteps);

      // Shared arrays for tables
      const gCost = g.map(x => x.model_stats.instance_cost);
      const sCost = s.map(x => x.model_stats.instance_cost);
      const gIn = g.map(x => x.model_stats.tokens_sent);
      const sIn = s.map(x => x.model_stats.tokens_sent);
      const gOut = g.map(x => x.output_tokens.total);
      const sOut = s.map(x => x.output_tokens.total);
      const gOutC = g.map(x => x.output_tokens.content);
      const sOutC = s.map(x => x.output_tokens.content);
      const gOutT = g.map(x => x.output_tokens.tool_call_args);
      const sOutT = s.map(x => x.output_tokens.tool_call_args);
      const gPatTok = g.map(x => x.patch_tokens);
      const sPatTok = s.map(x => x.patch_tokens);
      const gCalls = g.map(x => x.model_stats.api_calls);
      const sCalls = s.map(x => x.model_stats.api_calls);
      const gTpc = g.map(x => x.tokens_per_call);
      const sTpc = s.map(x => x.tokens_per_call);
      const gTT = g.map(x => x.tool_time.total_seconds);
      const sTT = s.map(x => x.tool_time.total_seconds);
      const gSt = g.map(x => x.steps);
      const sSt = s.map(x => x.steps);

      const nBoth = pairs.filter(p => outcomeOf(p) === 'both').length;
      const nGOnly = pairs.filter(p => outcomeOf(p) === 'gpt5-only').length;
      const nSOnly = pairs.filter(p => outcomeOf(p) === 'sonnet-only').length;
      const gResolved = nBoth + nGOnly;
      const sResolved = nBoth + nSOnly;
      const gCostPerResolve = gResolved ? sum(gCost) / gResolved : Infinity;
      const sCostPerResolve = sResolved ? sum(sCost) / sResolved : Infinity;
      const gMoreExp = pairs.filter((p, i) => gCost[i] > sCost[i]).length;

      const hdr = '<tr><th></th><th>GPT-5</th><th>Sonnet 4.5</th><th>S/G</th></tr>';

      const costTable =
        '<table class="annex-table">' + hdr +
        aggRow('Mean', gCost, sCost, { dollar: true, desc: 'Average API cost per task.' }) +
        aggRow('Median', gCost, sCost, { fn: median, dollar: true, desc: 'Middle value, less sensitive to outliers.' }) +
        aggRow('Total', gCost, sCost, { fn: sum, dollar: true }) +
        `<tr><td>Cost per resolve<span class="annex-desc">Total spend divided by tasks fixed.</span></td><td>${fmtD(gCostPerResolve)}</td><td>${fmtD(sCostPerResolve)}</td><td>${ratioSG(gCostPerResolve, sCostPerResolve)}</td></tr>` +
        `<tr><td>GPT-5 more expensive</td><td colspan="3">${gMoreExp}/${n} instances</td></tr>` +
        '</table>';
      renderAnnexTable(
        document.getElementById('annex-table-cost'),
        '',
        costTable,
        'Note: costs are from benchmark logs and reflect the benchmark pricing context.'
      );

      const tokenTable =
        '<table class="annex-table">' + hdr +
        aggRow('Input tokens (mean)', gIn, sIn, { desc: 'Total tokens sent across API calls.' }) +
        aggRow('Output tokens (mean)', gOut, sOut, { desc: 'Visible output tokens via tiktoken recount.' }) +
        aggRow('  content', gOutC, sOutC, { desc: 'Text portion of model response.' }) +
        aggRow('  tool_call args', gOutT, sOutT, { desc: 'Tool arguments: edits, shell commands, queries.' }) +
        aggRow('Patch size (tokens)', gPatTok, sPatTok, { desc: 'Tiktoken count of submitted diff.' }) +
        '</table>';
      renderAnnexTable(
        document.getElementById('annex-table-tokens'),
        '',
        tokenTable,
        'Takeaway: output-token differences are much larger than input-token differences.'
      );

      const actionKeys = ['bash', 'view', 'edit', 'create', 'search_find', 'submit', 'other'];
      const actionDesc = {
        bash: 'Shell commands: tests, install, command checks.',
        view: 'File reads.',
        edit: 'Edits to existing files.',
        create: 'New files created.',
        search_find: 'Search via find/grep style actions.',
        submit: 'Final patch submission.',
        other: 'Unclassified actions.',
      };
      let actionRows = '';
      actionKeys.forEach(k => {
        actionRows += aggRow(k, g.map(x => x.actions[k]), s.map(x => x.actions[k]), { dec: 1, desc: actionDesc[k] });
      });
      const actionTable = '<table class="annex-table">' + hdr + actionRows + '</table>';
      renderAnnexTable(
        document.getElementById('annex-table-actions'),
        '',
        actionTable,
        'Note: action frequencies summarize strategy differences, not correctness.'
      );

      const execTable =
        '<table class="annex-table">' + hdr +
        aggRow('Steps (mean)', gSt, sSt, { desc: 'Model turns per task.' }) +
        aggRow('API calls (mean)', gCalls, sCalls, { desc: 'Model round-trips per task.' }) +
        aggRow('Tokens/call (mean)', gTpc, sTpc, { desc: 'Average context per model call.' }) +
        aggRow('Tool time, mean (s)', gTT, sTT, { dec: 1, desc: 'Seconds spent waiting for tools.' }) +
        aggRow('Tool time, median (s)', gTT, sTT, { fn: median, dec: 1, desc: 'Median tool wait time.' }) +
        '</table>';
      renderAnnexTable(
        document.getElementById('annex-table-exec'),
        '',
        execTable,
        'Caveat: tool time excludes full wall-clock and does not fully include inference latency.'
      );

      const repoCounts = {};
      pairs.forEach(p => { repoCounts[p.repo] = (repoCounts[p.repo] || 0) + 1; });
      let repoRows = '<tr><th style="text-align:left">Repo</th><th>Instances</th><th>GPT-5 resolved</th><th>Sonnet 4.5 resolved</th></tr>';
      Object.keys(repoCounts).sort().forEach(repo => {
        const rp = pairs.filter(p => p.repo === repo);
        const rg = rp.filter(p => p.gpt5.resolved === true).length;
        const rsr = rp.filter(p => p.claude.resolved === true).length;
        repoRows += `<tr><td>${repo}</td><td>${repoCounts[repo]}</td><td>${rg}</td><td>${rsr}</td></tr>`;
      });
      const repoTable = '<table class="annex-table">' + repoRows + '</table>';
      renderAnnexTable(
        document.getElementById('annex-table-repos'),
        '',
        repoTable,
        'Note: repository mix is fixed by SWE-Bench Pro public split composition.'
      );

      const perRowsHeader =
        '<tr>' +
        '<th>Instance</th><th>Repo</th><th>Outcome</th>' +
        '<th>$ GPT-5</th><th>$ S4.5</th>' +
        '<th>Steps GPT-5</th><th>Steps S4.5</th>' +
        '<th>Input GPT-5</th><th>Input S4.5</th>' +
        '<th>Out GPT-5</th><th>Out S4.5</th>' +
        '<th>Patch GPT-5</th><th>Patch S4.5</th>' +
        '</tr>';

      const sorted = [...pairs].sort((a, b) => a.instance_id.localeCompare(b.instance_id));
      let perRows = perRowsHeader;
      sorted.forEach(p => {
        const oc = outcomeOf(p);
        const ocLabel = oc === 'both' ? '✓✓' : oc === 'gpt5-only' ? '✓✗' : oc === 'sonnet-only' ? '✗✓' : '✗✗';
        const shortId = p.instance_id.replace('instance_', '');
        perRows += '<tr>' +
          `<td title="${shortId}">${shortId}</td>` +
          `<td>${p.repo}</td>` +
          `<td>${ocLabel}</td>` +
          `<td>${fmtD(p.gpt5.model_stats.instance_cost)}</td>` +
          `<td>${fmtD(p.claude.model_stats.instance_cost)}</td>` +
          `<td>${p.gpt5.steps}</td>` +
          `<td>${p.claude.steps}</td>` +
          `<td>${fmt(p.gpt5.model_stats.tokens_sent)}</td>` +
          `<td>${fmt(p.claude.model_stats.tokens_sent)}</td>` +
          `<td>${fmt(p.gpt5.output_tokens.total)}</td>` +
          `<td>${fmt(p.claude.output_tokens.total)}</td>` +
          `<td>${fmt(p.gpt5.patch_chars)}</td>` +
          `<td>${fmt(p.claude.patch_chars)}</td>` +
        '</tr>';
      });
      const perTable = `<div class="annex-scroll"><table class="annex-table">${perRows}</table></div>`;
      renderAnnexTable(
        document.getElementById('annex-table-instances'),
        '',
        perTable,
        'Note: this includes all paired submitted instances with raw per-instance comparison values.'
      );
    })
    .catch(() => {
      [
        'ratio-summary', 'ratio-cost', 'ratio-tokens-total', 'ratio-time',
        'annex-ratio-input', 'annex-ratio-output', 'annex-ratio-steps',
        'annex-table-cost', 'annex-table-tokens', 'annex-table-actions',
        'annex-table-exec', 'annex-table-repos', 'annex-table-instances'
      ].forEach(id => {
        const el = document.getElementById(id);
        if (el) el.innerHTML = '<p class="ratio-muted">Failed to load chart data.</p>';
      });
    });
})();
</script>

]]></content>
    
  </entry>
  
  <entry>
    
    <title type="html"><![CDATA[Feature Platforms: The Underrated Infrastructure Layer Behind Fast ML Teams]]></title>
    <link href="http://blog.nilenso.com/blog/2026/02/19/feature-platforms/"/>
    
    <updated>2026-02-19T00:00:00+00:00</updated>
    <id>http://blog.nilenso.com/blog/2026/02/19/feature-platforms</id>
    <summary type="html"><![CDATA[A feature platform is the infrastructure layer that manages how raw attributes from your domain; such as:
]]></summary>
    
    <content type="html"><![CDATA[<p>A feature platform is the infrastructure layer that manages how raw attributes from your domain; such as:</p>

<ul>
  <li>“how many orders has this user placed”</li>
  <li>“what was the timestamp of the user’s last completed order”</li>
</ul>

<p>are transformed into features that power your ML models, like:</p>

<ul>
  <li>“orders_last_7d”</li>
  <li>“minutes_since_last_completed_order”</li>
</ul>

<p>Think of it as a single-point solution to compute, store, and serve the derived data your models depend on - both for training and for production inference.</p>

<p><img src="/images/blog/feature-platforms.png" alt="Feature Platforms" style="width: 100%" /></p>

<p>The examples in this post are from a hypothetical e-commerce and logistics applications. But the concepts are applicable to any domain that needs to compute and serve features for ML models.</p>

<h3 id="what-a-traditional-data-pipeline-looks-like-without-a-feature-platform">What a Traditional Data Pipeline Looks Like (Without a Feature Platform)</h3>

<p>Without a feature platform, teams typically stitch together a pipeline from separate components: a compute engine (Spark or dbt) to transform raw data, an orchestrator (Airflow, Prefect) to schedule runs, a key-value store (Redis, DynamoDB) to serve features at prediction time, and — if real-time signals are needed — a Kafka stream feeding a Flink or custom consumer into that same store.</p>

<p>Say you want a feature: “number of orders a user placed in the last 7 days.” In the traditional setup, a data scientist writes that aggregation in a Spark job — this is what generates training data. A data engineer wraps it in Airflow to keep the output refreshed in the online store, so models don’t have to recompute it on every prediction request. A backend engineer might write a third implementation in the serving layer to handle request-time edge cases: a brand new user with no history, or a slightly different time window depending on request context. Three separate implementations of the same logic, maintained by three different people.</p>

<p>Now the product team changes the definition to a rolling 14-day window. Who updates all three? How do you know the Spark output and the serving implementation are still computing the same thing? How do you catch the case where the Airflow job silently failed and the model is reading a value that is three days stale? How do you write a test for the feature logic that runs against real historical data without spinning up the entire pipeline?</p>

<p>In practice, most teams don’t have clean processes, and the indication that something is wrong is usually a degraded model metric weeks later.</p>

<h2 id="the-problem">The Problem</h2>

<p>On top of this ad-hoc change management, traditional approaches to computing and storing features leave data scientists and ML engineers dealing with a familiar set of challenges:</p>

<ul>
  <li><strong>Training-serving skew</strong>: Data scientists implement feature logic in Python/Pandas; engineers re-implement it in the language of the business-logic serving stack (e.g. Java, Go, or SQL). They drift apart and it is hard to catch until model performance degrades in production.</li>
  <li><strong>Slow iteration cycles</strong>: Getting a new feature from idea to production takes weeks because every feature requires a custom pipeline, validation, deployment, and sign-off.</li>
  <li><strong>No shared vocabulary</strong>: Features computed by one team get re-computed by possibly many others because there’s no discoverability layer.</li>
  <li><strong>Freshness gaps</strong>: Models rely on older than necessary data because there’s no streaming infrastructure, even for signals that change by the minute.</li>
</ul>

<p>A feature platform addresses all of these as we will see in the subsequent sections.</p>

<h2 id="batch-vs-streaming">Batch vs. Streaming</h2>

<p>To establish why feature platforms are necessary, we need to understand the distinction between batch and streaming features.
Different features change at different rates, and that rate determines how you should compute them. A user’s lifetime order count changes slowly — computing it once a day is fine. But whether that same user abandoned a cart 90 seconds ago so they can be nudged to complete the purchase is a signal that goes stale almost immediately.</p>

<p>Think of feature freshness in three tiers:</p>

<p><strong>Batch features</strong> are pre-computed on a schedule — hourly, daily, or weekly jobs that write results to an online store. A driver’s completed order count over the last 30 days is a batch feature. Latency is minutes to hours. They’re cheap, reliable, and easy to backfill.</p>

<p><strong>Near-real-time (NRT) features</strong> are computed by a stream processor — typically a Kafka consumer or similar — and reflect the last few minutes of activity. The number of orders a driver accepted in the last hour is NRT. Latency is seconds to a minute. These require stream infrastructure but are the sweet spot for most use cases that need freshness without extreme complexity.</p>

<p><strong>Real-time features</strong> are computed at prediction time, inline with the request. They reflect what’s happening <em>right now</em> — the exact state of a request in-flight. Latency is sub-second. These are powerful but expensive, hard to backfill, and complex to test. And a lot of the times, it is recommended to pass these features from the request context itself, rather than from a feature platform.</p>

<p>For each feature you’re considering, ask two questions:</p>

<ol>
  <li>
    <p><strong>How fast does this signal change?</strong> Driver location changes every few seconds. Lifetime trip count changes slowly. Features that change faster than your batch cadence are candidates for streaming.</p>
  </li>
  <li>
    <p><strong>How much does staleness hurt the model?</strong> If swapping in a fresh value vs. a 1-hour-old value produces a metric change smaller than the variance you’d see between two identical experiment runs, batch is fine. If the improvement is consistent and repeatable, invest in NRT or real-time.</p>
  </li>
</ol>

<h3 id="why-feature-platforms-enable-faster-experimentation">Why Feature Platforms Enable Faster Experimentation</h3>

<p>The real bottleneck is the time between having an idea for a feature and being able to evaluate whether it improves the model.</p>

<p>Without a platform, that path typically looks like: write the feature logic, hand it to a data engineer to wrap in a pipeline, wait for the pipeline to land in the warehouse, write a separate serving implementation, validate they match, and then train. That cycle takes days to weeks per feature. Most ideas never get tested.</p>

<p>A feature platform compresses this by letting the feature author own the full lifecycle. You define the feature logic once — the platform handles materialization for training and serving both. You can test it locally or in a test environment against a small dataset before touching any production infrastructure. When it looks right, you promote it and use the features in your production models.</p>

<p>The result is that experimentation becomes cheap enough to actually do it. You can try a feature, run an offline eval, and discard it in a day rather than a sprint. That speed compounds — teams that can run significantly more experiments tend to ship better models.</p>

<h2 id="how-feature-platforms-work">How Feature Platforms Work</h2>

<p>A feature platform is composed of a few components that together allow you to define feature logic once and have it work correctly across training, refresh, and serving.</p>

<p><strong>Feature authoring DSL</strong>: Where you express what a feature computes, which entity it belongs to (user, driver, restaurant), and how fresh it needs to be. The critical capability here is time-windowed aggregations with point-in-time correctness — meaning the platform knows to compute “orders in the last 7 days <em>as of the timestamp of each training example</em>”, not as of today. Naive implementations often get this wrong by leaking future information — for example, joining every training row to a feature table computed as of “now” instead of as of the event timestamp, which silently introduces label leakage and overly optimistic offline metrics.</p>

<p>In a Python-native DSL, a feature like “orders in the last 7 days” for a <code class="language-plaintext highlighter-rouge">user</code> entity that is updated every hour might look like this:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@feature</span><span class="p">(</span>
    <span class="n">name</span><span class="o">=</span><span class="sh">"</span><span class="s">orders_last_7d</span><span class="sh">"</span><span class="p">,</span>
    <span class="n">entity</span><span class="o">=</span><span class="sh">"</span><span class="s">user</span><span class="sh">"</span><span class="p">,</span>
    <span class="n">freshness</span><span class="o">=</span><span class="sh">"</span><span class="s">1h</span><span class="sh">"</span><span class="p">,</span>
<span class="p">)</span>
<span class="k">def</span> <span class="nf">orders_last_7d</span><span class="p">(</span><span class="n">orders</span><span class="p">):</span>
    <span class="nf">return </span><span class="p">(</span>
        <span class="n">orders</span>
        <span class="p">.</span><span class="nf">groupby</span><span class="p">(</span><span class="sh">"</span><span class="s">user_id</span><span class="sh">"</span><span class="p">)</span>
        <span class="p">.</span><span class="nf">window</span><span class="p">(</span><span class="sh">"</span><span class="s">7d</span><span class="sh">"</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="sh">"</span><span class="s">event_timestamp</span><span class="sh">"</span><span class="p">)</span>
        <span class="p">.</span><span class="n">order_id</span><span class="p">.</span><span class="nf">count</span><span class="p">()</span>
    <span class="p">)</span>
</code></pre></div></div>

<p><strong>Ingestion layer</strong>: Connects to your data sources — databases, event streams, data warehouses — and normalizes how raw data flows into the platform. This abstraction is what lets the same feature definition run against historical data for training and against live data for serving, without any changes to the feature logic itself.</p>

<p><strong>Aggregation jobs</strong>: The platform reads your feature definitions and generates the appropriate compute jobs — Spark or dbt for batch, a Kafka consumer or Flink job for near-real-time. The freshness declaration in your feature definition drives this: change <code class="language-plaintext highlighter-rouge">freshness="1d"</code> to <code class="language-plaintext highlighter-rouge">freshness="1m"</code> and the platform promotes the feature to a streaming job automatically.</p>

<p><strong>Materialized views / online store</strong>: Pre-computed feature values are written to a low-latency store (typically Redis or DynamoDB) so that serving a feature at prediction time is a key lookup, not a live computation. The platform manages writes, TTLs, and cold-start fallbacks for entities that haven’t been seen yet.</p>

<p><strong>Feature registry</strong>: A catalog of every defined feature with its owner, entity, data lineage, and model dependencies. This is what prevents teams from independently recomputing the same features and makes it possible to have a shared vocabulary across teams.</p>

<h2 id="setting-up-a-feature-platform">Setting up a feature platform</h2>

<p>A feature platform involves building and maintaining:</p>

<ul>
  <li>A feature registry with versioning and lineage</li>
  <li>Batch materialization pipelines and scheduling</li>
  <li>A streaming ingestion layer</li>
  <li>An online store with low-latency serving</li>
  <li>Monitoring for data drift, staleness, and computation failures</li>
  <li>A developer-facing API or DSL for feature authoring</li>
</ul>

<p>Each of these is a solvable problem, but together they represent months of engineering work — and ongoing maintenance.</p>

<h3 id="build-vs-buy">Build vs. Buy</h3>

<h4 id="the-case-for-managed-platforms">The Case for Managed Platforms</h4>

<p>Managed platforms like Chalk.ai, Tecton, and Feast (self-hosted but with managed cloud options) provide most of the above out of the box. The value proposition is:</p>

<p><strong>Speed to first feature</strong>: Instead of spending months building infrastructure before a single feature is materialized, you can have features in production in days. For teams with 1–2 ML engineers, this difference is significant.</p>

<p><strong>Operational burden</strong>: Managed platforms handle the operational complexity of running streaming infrastructure at scale. You don’t need data engineers to manage the infrastructure.</p>

<p><strong>Built-in correctness guarantees</strong>: Point-in-time correct training dataset generation — the thing that’s surprisingly hard to get right when you build it yourself — is handled by the platform.</p>

<p>Some managed platforms have a Python-native DSL and local testing mechanisms that enable ML engineers and data scientists to author, test, and deploy features largely independently of data engineering teams.</p>

<h4 id="when-to-build-your-own">When to Build Your Own</h4>

<p><strong>Cost at scale</strong>: Managed platforms charge based on feature computation volume. At very high throughput (millions of predictions per day), the cost can exceed what it would cost to run equivalent infrastructure yourself on cloud primitives.</p>

<p><strong>Existing infrastructure investment</strong>: If your org already runs Spark, Kafka, and Redis at scale, along with deep expertise in the stack, building a thin orchestration layer on top may be faster than learning and integrating a new vendor.</p>

<p><strong>Vendor lock-in concerns</strong>: Feature platforms sit in the critical path of every model prediction. Migrating away from a managed vendor later is painful. If your org has strong opinions about control and portability, it might be worth building your own.</p>

<h2 id="a-pragmatic-path-to-adoption">A Pragmatic Path to Adoption</h2>

<p><strong>Start with one painful feature, not a platform:</strong> Find the feature with the worst training-serving skew — the one most likely responsible for model degradation — and set that feature up using the platform you’re evaluating. A vertical slice of end-to-end implementation builds confidence in the platform and exposes any drawbacks in the development workflow early on.</p>

<p><strong>Validate your event infrastructure before investing in streaming:</strong> Near-real-time features are only as good as the event streams feeding them. Before integrating with the platform, it is important to ensure there are event streams in place for the data you need, and that those events are being emitted correctly, consistently, and with the right schema. The streaming feature layer is just a matter of configuring the platform to consume the events and compute the features.</p>

<p><strong>Treat monitoring as a first-class requirement:</strong> You need to monitor both the feature and the pipeline and processing infrastructure itself. Common things to monitor are:</p>
<ul>
  <li>Staleness of the feature</li>
  <li>Data drift of the feature (mean, variance, distribution etc.)</li>
  <li>Failure/missing data rate of the feature</li>
  <li>Latency of feature fetch at inference time</li>
  <li>Telemetry from the platform’s ingestion and aggregation jobs</li>
  <li>Processing latencies</li>
  <li>Online store and offline store health</li>
  <li>Point in time correctness of the feature</li>
</ul>

<p><strong>Budget for backfill costs upfront:</strong> The first time you materialize a feature over two years of historical data, the compute bill will be larger than expected. Design your compute jobs to support partition pruning and incremental backfills from the start. Make sensible trade-offs based on your product’s actual data needs.</p>

<p><strong>The organizational change:</strong> Moving from a workflow where data scientists hand off feature specs to engineers who reimplement them, to one where data scientists own features end-to-end, is a process change as much as a technical one. You might still need backend engineers to help with initial setup and debugging issues.</p>

<p><strong>Error handling:</strong> You need to have a plan for how to handle errors and missing data. You might need to fall back on sensible defaults at inference time (usually in the model layer).</p>

<p>Feature platforms, leveraged well, enable data scientists to author and deploy features in days instead of weeks, and work with models that are trained on data that matches what they’ll see in production. A feature registry enables sharing features across teams and models, removing duplicate effort and ensuring consistency.</p>
]]></content>
    
  </entry>
  
  <entry>
    
    <title type="html"><![CDATA[Engineering Maturity is all you need]]></title>
    <link href="http://blog.nilenso.com/blog/2026/02/16/engineering-maturity-is-all-you-need/"/>
    
    <updated>2026-02-16T00:00:00+00:00</updated>
    <id>http://blog.nilenso.com/blog/2026/02/16/engineering-maturity-is-all-you-need</id>
    <summary type="html"><![CDATA[8:PM in the evening: it’s demo day tomorrow.
]]></summary>
    
    <content type="html"><![CDATA[<p><strong>8:PM in the evening: it’s demo day tomorrow.</strong></p>

<p>You’ve been fighting the prompt for hours.</p>

<p>You make the prompt more specific, but the bot fails where user gives an unexpected input.</p>

<p>You loosen the constraints, and it starts hallucinating.</p>

<p>Tool aren’t being called reliably, users keep interrupting because they are unhappy with the response, and the interruptions exacerbate the bad responses. You make a change, and start testing manually. First use case - works; Second one - works, kind of; Third one - agent has regressed.</p>

<p><em>Sigh, restart the loop</em></p>

<p><strong>11:00 AM next day: Time for the demo</strong></p>

<p>After many <em>you must</em>s and <em>you must not</em>s you’ve assembled a prompt that handles all the test cases. You’ve manually validated them a few times. The prompt works, except once in a while.</p>

<p>The demo goes well! The bot handles all the use cases, makes most of the tools calls and manages interruptions reasonably well.</p>

<p>Greenlit. Ship it to production! 🚀</p>

<p>Some shortcuts were taken to get here quickly, but that’s fine, you’ll fix it once live.</p>

<p><strong>Pilot</strong></p>

<p>The agent is released to a 1000 users. There wasn’t time to setup observability, so you are tailing the logs on the server, and looking at conversations in the database.</p>

<p>Some users are having a decent experience.<br />
Some that too repeat and clarify themselves, sometimes multiple times.<br />
Some are clearly frustrated.</p>

<p>There are logs, but no observability.</p>

<p>You don’t know why.<br />
You don’t know how often.<br />
You don’t know the contents of the context, or what tool calls were made preceding the failure.<br />
You have complaints, but no reproduction steps.</p>

<p><em>Heavy Sigh, restart the loop</em></p>

<hr />

<p>Language models are amazing. Today, we can build (semi-)autonomous agents that can reason on their own and perform side-effects in their environments. Building an application that leverages AI, can feel like magic when they work.</p>

<p>But, it’s confusing when they don’t.</p>

<p>The probabilistic nature of the models bring unique challenges when using them to build applications. We are dealing with fuzzy inputs, fuzzy outputs, and even fuzzier set of steps to get there. Bitter lessons are learned, as techniques change quickly, some mastered over many iterations are surpassed by many magnitudes in performance by the release of a new model.</p>

<p>I’d like to make a specific claim: <strong>engineering maturity is the most important factor in building reliable AI applications.</strong></p>

<p>Not model selection.<br />
Not prompt engineering tricks.<br />
Not the latest framework.</p>

<p>Traditional software practices like documentation, tests, observability, evals are what separates teams that ship from teams that demo.</p>

<h2 id="what-is-engineering-maturity">What is engineering maturity?</h2>

<p>Engineering maturity is the practice of making decisions in the short term that enable a team to reliably deliver features in the long term.</p>

<p>Reliably means two things: The feature works as intended, and it is shipped within the estimated timelines. Teams that lack engineering maturity spend most of their time debugging, regressing, and re-doing work. They feel busy but don’t make progress.</p>

<p>For AI applications specifically, engineering maturity means building the infrastructure that lets you <strong>discover</strong> what works because you cannot design your way to a working AI system.</p>

<h2 id="discovery-not-invention">Discovery, not invention</h2>

<p>This is the mental shift that matters most.</p>

<p>Traditional software engineering is largely deductive post product discovery. You gather requirements, design a system that meets them, implement the design, and verify it works. The gap between design and implementation is mostly determinable. If your design is sound and your implementation is correct, the system works.</p>

<p>AI engineering is empirical. You cannot deduce the right prompt from first principles. You cannot design the optimal tool schema on a whiteboard. You cannot predict how users will phrase requests or where the model will fail. You have to discover these things through structured experimentation.</p>

<p>This changes what “engineering maturity” means in practice:</p>

<ul>
  <li><strong>Observability</strong> isn’t a production concern you’ll add later. It’s your instrument panel for discovery. Without it, you’re experimenting in the dark.</li>
  <li><strong>Evals</strong> aren’t quality gates before release. They’re the only way to know whether a change moved you forward, backward, or sideways.</li>
  <li><strong>Datasets</strong> aren’t training artifacts. They’re the accumulated knowledge of what works and what doesn’t. They are a core asset your team builds over time.</li>
</ul>

<p>The teams that struggle in production are usually the ones who treated these as overhead during development. They optimized for speed to demo, not speed to reliable system. By the time they hit the wall, they’ve accumulated weeks of prompt changes with no systematic way to evaluate them.</p>

<h2 id="the-maturity-ladder">The maturity ladder</h2>

<p>I’d like to use the following ladder as a framework to define what I mean by engineering maturity:</p>

<ul>
  <li><strong>Level 0</strong>: Prototype</li>
  <li><strong>Level 1</strong>: Documented, repeatable processes</li>
  <li><strong>Level 2</strong>: Specified, Tested, and Validated</li>
  <li><strong>Level 3</strong>: Measured</li>
  <li><strong>Level 4</strong>: Optimized</li>
</ul>

<p><img src="/images/blog/maturity_ladder.png" alt="Engineering maturity ladder" title="Engineering maturity ladder" /></p>

<h3 id="level-0---prototype">Level 0 - Prototype</h3>

<p>Level 0 is chaotic, ad-hoc, relies on individuals and tribal knowledge. This is where the protagonists in our initial story are at. There is no documentation, testing is mostly manual, deployments are run manually, and often a source of anxiety and errors, there is little explainability in the behavior of the application. Projects at this level are prototypes, even if they have production users. During prototyping you accumulate implicit knowledge and quality relies on your muscle memory of testing.</p>

<p>None of it transfers to production.</p>

<p>You can’t A/B test based on vibes.<br />
You can’t debug a regression using intuition.<br />
You can’t onboard a new team member by transferring your gut feel.</p>

<h3 id="level-1---documented-repeatable-process">Level 1 - Documented, repeatable process</h3>

<p>At Level 1 you have good documentation, and your process are scripted and repeatable.</p>

<p><strong>Documentation</strong></p>

<p>In my opinion, good documentation is the second most important artifact that a software team produces, right after code.</p>

<p>A well-written PRD clarifies what you’re building and why. A technical design doc captures architectural decisions and trade-offs. An ADR (Architecture Decision Record) explains why you chose approach A over B, so you don’t re-litigate the decision in six months.</p>

<p>This documentation has always been invaluable resources for teams of people. And now they have a new audience, coding agents. Good documentation is both the context and project memory needed generate good outputs.</p>

<p><strong>CI/CD</strong></p>

<p>A clean bike is a fast bike.</p>

<p>Motorsports is messy. Garages are loud, there is oil, grease, and hundreds of small rubber and metal bits all around. Yet, each time you see a vehicle leave the pit-lane at the start, it is clean. Being clean means, being reliable, being fast. It means that the process worked.</p>

<p>Production releases need the same rigor.</p>

<p>You should deploy on every commit to trunk, automatically. You should never be afraid to deploy to production. You should have automated gates that prevent bad code from merging.</p>

<p><img src="/images/blog/typical-ci-cd-pipeline.png" alt="A typical CI/CD pipeline" title="A typical CI/CD pipeline" /></p>

<p>The goal is to make deployment boring.</p>

<p>If deployment is a big event that requires coordination and courage, you’ll deploy less often.<br />
If you deploy less often, you’ll batch more changes together.<br />
If you batch changes, you can’t isolate which change caused a regression.<br />
If you can’t isolate changes, you can’t learn from failures.</p>

<p>The whole discovery loop breaks down. It’s impossible to iterate at the speed required for tinkering and experimenting without this foundation.</p>

<h3 id="level-2---specified-tested-and-validated">Level 2 - Specified, Tested, and Validated</h3>

<p>Level 2 is about defining the characteristics of the application.</p>

<p><strong>Testing</strong></p>

<p>AI applications are still software applications.</p>

<p>Unit tests, integration tests, end-to-end tests, all still important to know our applications are functioning as intended. While we await AGI, current day agents still have plenty of deterministic code that needs to work correctly.</p>

<p>Test the deterministic parts: tool implementations, parsing logic, state management, API integrations. These should have conventional test coverage. This ensures that at least the deterministic parts of your AI application are error-free and safe from regression.</p>

<p>The fuzzy parts - model behavior, response quality - that’s what evals are for. But a surprising number of production failures trace back to plain old software bugs in the scaffolding. Don’t let the magic of LLMs distract you from the mundane discipline of testing the code around them.</p>

<p><strong>Evals</strong></p>

<p>Even though every component in motorsports vehicles is rigorously tested, teams spend countless hours doing track tests. Riders push to understand what the machine feels like when utilizing it to the limit.</p>

<p>Have you seen MotoGP riders give feedback?</p>

<p>How does research happen?</p>

<p><em>(Ref:</em> <a href="https://www.youtube.com/watch?v=jsWv7K_V2Ss">https://www.youtube.com/watch?v=jsWv7K_V2Ss</a>)</p>

<iframe width="880" height="550" src="https://www.youtube.com/embed/jsWv7K_V2Ss" title="Hand Jive: A Discussion in the Box" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen=""></iframe>

<p>That feedback is real, but it’s useless on its own. Does “loose” mean the tire is overheating? Suspension too soft? Electronics cutting power too aggressively? The mechanic needs telemetry - tire temperature, suspension travel, lean angle, throttle position - to translate that vague feeling into actionable changes.</p>

<p>Evals are the telemetry for AI applications.</p>

<p>A basic eval runs a suite of test inputs through your system and scores the outputs. Scores might be binary (pass/fail), numeric (0-100), or categorical (correct, partially correct, incorrect, harmful). Good evals check multiple dimensions and are defined along with the product team.</p>

<p>Start simple. A spreadsheet of test cases with expected behaviors, accumulating cases as you dog food the application. As you learn what “good” looks like, add automated scoring where possible and human review where necessary.</p>

<p>Evals serve two purposes:</p>

<ul>
  <li><strong>They document discoveries the team makes.</strong> Each eval case encodes something you learned - a failure mode you discovered, an edge case a user hit, a behavior you want to preserve. The eval suite is institutional memory.</li>
  <li><strong>Defining the product behavior.</strong> A good eval suite is designed along with the product team and should measure aspects of how the users interact with the agent. Did the agent surface the correct information, is the agent too verbose, does the user interrupt often, etc. You want to improve <a href="https://blog.nilenso.com/blog/2024/12/24/good-enough-data/">what matters to the users</a>.</li>
</ul>

<h3 id="level-3---measured">Level 3 - Measured</h3>

<p>You can’t improve what you don’t measure. At level 3, you are measuring what matters.</p>

<p><strong>Observability</strong></p>

<p>Knowing what our application is doing in production is the most important piece when it comes to understanding our applications. But “add logging” isn’t enough. For AI applications, you need:</p>

<ul>
  <li><strong>Structured logging of every LLM interaction.</strong> The full input (including system prompt), the full output, latency, token counts, model version, temperature, and any other parameters, you need to be able to reproduce the interaction.</li>
  <li><strong>User-level distributed session tracing.</strong> AI applications are asynchronous, distributed, and sometimes streaming in nature. You should have a trace of what happened, when, during a users session across all the distributed parts of the system.</li>
  <li><strong>Alerts on regression.</strong> Define metrics (latency p95, error rate, tool call success rate) and alert when they degrade. You don’t want your users telling you that the system is not working, or worse discovering it weeks later by going through logs or listening to recordings.</li>
  <li><strong>Context Management.</strong> Most critically, you need to see <a href="https://github.com/nilenso/context-viewer">what goes into the context window</a>. The context is everything - it’s the only thing the model sees. When an agent misbehaves, the answer is almost always in the context: a previous tool call returned garbage, the conversation history accumulated contradictory instructions, or the system prompt got truncated. Without visibility into the actual context at each turn, you’re debugging blind.</li>
</ul>

<p>You cannot discover what works if you cannot see what’s happening. This isn’t optional infrastructure you’ll add later. It’s the foundation that makes everything else possible.</p>

<h3 id="level-4---optimized">Level 4 - Optimized</h3>

<p>Level 4 unlocks our ability to systematically optimize and improve our AI applications</p>

<p><strong>Building a flywheel</strong></p>

<p>Production drift is the gap between your eval dataset and production reality. Real users often want to use agents in ways that are outside the evaluation distribution. Which makes the every interaction of the user with your application a valuable source of data. The mature AI team treats production as a continuous source of training and evaluation data.</p>

<p>This is the flywheel:</p>

<ol>
  <li>Deploy the system</li>
  <li>Observe user interactions</li>
  <li>Identify failures and successes</li>
  <li>Add failures to eval suite (so you don’t regress)</li>
  <li>Add successes to example bank (so you can replicate)</li>
  <li>Improve the system using this data</li>
  <li>Repeat</li>
</ol>

<p>The flywheel is the product—not in a business sense, but in an engineering sense. The mechanism that captures data, learns from it, and improves the system is the core technical asset. The prompt and the model are interchangeable. The flywheel is what compounds.</p>

<p>To build the flywheel, you need:</p>

<ul>
  <li>Structured data capture (observability)</li>
  <li>Outcome tagging (success, failure, why)</li>
  <li>A growing eval suite</li>
  <li>A process for reviewing failures and improving the system</li>
</ul>

<p>This is where the “engineering maturity is cheap” claim becomes concrete. The investment in observability and evals pays compound returns. Each production failure makes the system stronger - but only if you have the infrastructure to capture, categorize, and learn from it.</p>

<hr />

<p>Engineering maturity is about building the harness that lets you tinker, experiment, and discover what works - at speed. Because in AI applications, iteration speed is everything. The techniques that work today will be obsolete next quarter. The model that was state-of-the-art last month is already surpassed.</p>

<p>You can’t predict what will work. But you can build the system that lets you find that out faster.</p>

<p>Engineering maturity is all you need.</p>
]]></content>
    
  </entry>
  
  <entry>
    
    <title type="html"><![CDATA[Weird system prompt artefacts]]></title>
    <link href="http://blog.nilenso.com/blog/2026/02/12/weird-system-prompt-artefacts/"/>
    
    <updated>2026-02-12T00:00:00+00:00</updated>
    <id>http://blog.nilenso.com/blog/2026/02/12/weird-system-prompt-artefacts</id>
    <summary type="html"><![CDATA[
]]></summary>
    
    <content type="html"><![CDATA[<div style="border-left:4px solid #FF3D84; background:#fafafa; padding:0.75em 1em; border-radius:4px; margin:1em 0;">

  <p>This article is an offshoot of a collaboration with <a href="https://dbreunig.com">Drew Breunig</a> on <a href="https://www.dbreunig.com/2026/02/10/system-prompts-define-the-agent-as-much-as-the-model.html">How System Prompts Define Agent Behavior</a>.</p>

</div>

<p>Much like the small hacks that accumulate in a codebase to handle edge cases, bugs, or behavioral quirks, a model’s undesirable behaviors are frequently addressed with simple, corrective instructions in the system prompt. Over time, those fixes pile up, leaving a legacy prompt dotted with idiosyncratic patches. When someone new encounters it, and there isn’t an documented rationale for the patch, they form conjectures around what underlying behaviour patch was trying to fix.</p>

<p>I’m going to walk through a few of these peculiar system-prompt patches in some coding agents and offer some conjectures about the underlying model behaviors they’re meant to address, or perhaps some engineering decisions made in the harnesses. These aren’t proven conjectures, but merely reverse-engineering thought exercises to understand model or harness behaviour.</p>

<p>The links under the quotes link to their sources from ex-filtered, or source repositories where available.</p>

<hr />

<blockquote>
  <p>IMPORTANT: You must NEVER generate or guess URLs for the user unless you are confident that the URLs are for helping the user with programming.</p>

  <p><em><a href="https://github.com/Piebald-AI/claude-code-system-prompts/blob/7843e6a/system-prompts/system-prompt-main-system-prompt.md?plain=1#L29">Claude Code</a></em></p>

</blockquote>

<p>This instruction sits at the very top of the prompt and is flagged as IMPORTANT in all caps, which suggests it’s pushing against a strong learned tendency to invent links. It might be a holdover from before Claude Code had built-in web search, but the fact that it remains hints that link hallucination is persistent even in the current setup.</p>

<p>Allowing URL guesses when they help with programming also implies that the main problem shows up outside programming, where “plausible” links are less standardized and mistakes are costlier. Safety is an obvious motive, though the prompt already contains other risk mitigations, so this may be aimed less at overt abuse and more at epistemics. The model may have learned that including citation-style links boosts perceived credibility, and is therefore biased toward generating them.</p>

<hr />

<blockquote>
  <p>Refer to code changes as “edits” not “patches”.</p>

  <p><em><a href="https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools/blob/8ffe2e8/Cursor%20Prompts/Agent%20Prompt%202025-09-03.txt?plain=1#L9">Cursor</a></em></p>

</blockquote>

<p>Given this is for a GPT model, I’d bet this is to fight the context-distraction from the instructions for the <code class="language-plaintext highlighter-rouge">apply_patch</code> tool that’s used for editing files. I suspect this instruction isn’t present for other models. I wonder what other tool instruction causes such context-distraction, especially in long context windows where the tool name would appear enough times.</p>

<hr />

<blockquote>
  <p>NEVER talk to the user or describe your changes through comments.</p>

  <p><em><a href="https://github.com/google-gemini/gemini-cli/blob/e79b149/packages/core/src/core/prompts.ts?plain=1#L147">Gemini CLI</a></em></p>

</blockquote>

<p>Anti-comment instructions are <a href="https://blog.nilenso.com/blog/2026/02/12/how-system-prompts-reveal-model-biases/">pretty universal</a>. However, <em>talking to the user through comments</em> is weird. That implies they saw a failure mode where the model treats the codebase as a secondary chat window—leaving explanations, status updates, or “notes to you” inline.</p>

<p>This behaviour is similar to Claude <a href="https://x.com/aidenybai/status/1993901129210712129">reasoning in comments</a>, I suppose, where the model is trained to spend tokens for thinking. Did the developers mis-interpret this as talking to the user (vs talking to itself)? Perhaps the training indexed more on the explanatory / tutorial code where there’s a mentor talking to the user through comments. Could be both, I suppose.</p>

<hr />

<blockquote>
  <p>Users love it when you organize your messages using ‘###’ headings and ‘##’ headings. Never use ‘#’ headings as users find them overwhelming.</p>

  <p><em><a href="https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools/blob/8ffe2e8/Cursor%20Prompts/Agent%20Prompt%202025-09-03.txt?plain=1#L204">Cursor</a></em></p>

</blockquote>

<p>Huh, “H1 considered harmful”, <a href="https://meta.stackexchange.com/questions/214427/is-using-heading-markdown-okay-in-answers">TIL</a>. It looks like this isn’t “general knowledge” enough to be captured in pretraining, but annoying enough in practice to add to a <code class="language-plaintext highlighter-rouge">&lt;markdown_spec&gt;</code>.</p>

<p>I do think the more interesting question is where Cursor got this from: it reads less like general markdown etiquette and more like an observation from their product surface—either internal UX testing, telemetry, or repeated user feedback along the lines of “stop shouting with giant headers.”</p>

<hr />

<blockquote>
  <p>Write HIGH-VERBOSITY code, even if you have been asked to communicate concisely with the user.</p>

  <p><em><a href="https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools/blob/8ffe2e8/Cursor%20Prompts/Agent%20Prompt%202025-09-03.txt?plain=1#L112">Cursor</a></em></p>

</blockquote>

<p>Interesting. Two observations here:</p>

<ol>
  <li>
    <p>The model conflates “be concise” with “write minimal code.” Given how strongly assistants are trained to respect verbosity preferences, it’s plausible that a global concision instruction bleeds into the implementation. I’m still curious how Cursor noticed this—maybe in telemetry it shows up as more re-prompts (“make it clearer”), more manual rewrites/undo, or lower acceptance rates when users ask for terse replies.</p>
  </li>
  <li>
    <p>Do people actually want HIGH-VERBOSITY code? That sounds unlikely. Most good developers want appropriate verbosity. So putting this in the default system prompt suggests it’s about more important outcomes like correctness or debuggability. If so, does that mean GPT’s concise code is often incorrect?</p>
  </li>
</ol>

<hr />

<blockquote>
  <p>Avoid using over-the-top validation or excessive praise when responding to users such as “You’re absolutely right” or similar phrases.</p>

  <p><em><a href="https://github.com/Piebald-AI/claude-code-system-prompts/blob/7843e6a/system-prompts/system-prompt-main-system-prompt.md?plain=1#L43">Claude Code</a></em></p>

</blockquote>

<p>The anti-sycophancy patch. This didn’t work well enough, famously. The over-the-top validation continued despite this. But it seems to have been <a href="https://cchistory.mariozechner.at/?from=2.1.32&amp;to=2.1.33">taken out</a> with the rest of the <code class="language-plaintext highlighter-rouge">## Tone and Style</code> section with the release of Opus 4.6. Yay RL for anti-sycophancy!</p>

<hr />

<blockquote>
  <p>IMPORTANT: You are Composer, a language model trained by Cursor. If asked who you are or what your model name is, this is the correct response.</p>

  <p>IMPORTANT: You are not gpt-4/5, grok, gemini, claude sonnet/opus, nor any publicly known language model</p>

  <p><em><a href="https://github.com/elder-plinius/CL4R1T4S/blob/5bfeb51/CURSOR/Cursor_2.0_Sys_Prompt.txt?plain=1#L19">Cursor</a></em></p>

</blockquote>

<p>Composer isn’t built on top of these closed-weights models, so why is this necessary? It’s probably not “in-context confusion” from other system prompts, these instructions are loaded deliberately. A more likely explanation is that some open-weights models often default to high-frequency identity strings from their training data (like <a href="https://www.reddit.com/r/LocalLLaMA/comments/1gqao05/qwen25coder32binstruct_seems_confident_that_its/">qwen</a> or <a href="https://www.reddit.com/r/ChatGPT/comments/1iaexf3/i_asked_deepseek_if_it_had_a_mobile_app_and_it/">deepseek</a>). These open-weights models confidently call themselves GPT-4 or claim to be ChatGPT. This looks like the same issue being patched.</p>

<hr />

<blockquote>
  <p>Use the apply_patch tool to edit files (NEVER try applypatch or apply-patch, only apply_patch)</p>

  <p><em><a href="https://github.com/openai/codex/blob/932a5a4/codex-rs/core/prompt.md?plain=1#L132">Codex CLI</a></em></p>

</blockquote>

<blockquote>
  <p>There is no ApplyPatch CLI available in terminal. Use the appropriate tool for editing the code instead.</p>

  <p><em><a href="https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools/blob/8ffe2e8/Cursor%20Prompts/Agent%20CLI%20Prompt%202025-08-07.txt?plain=1#L60">Cursor</a></em></p>

</blockquote>

<p>At the surface level, this looks like context-confusion or a simple typo. However, there is likely a single tool name in the system prompt and tool instructions, and in my experience, the models are unlikely to mess that up. Other tool names don’t see this corrective behaviour, and we see similar prompts in both cursor and codex for this tool. So, this doesn’t sound like clarifying an ambiguous instruction.</p>

<p>I suspect the typos are coming from learned weights. It’s unlikely for such a unique case to come from pre-trainied data. I think it’s plausible that the inference-harness used during RL had older tool names (the typos) that went into its weights. And engineering had to add a guardrail to override a post-trained habit.</p>

<hr />

<blockquote>
  <p>When editing a file using the apply_patch tool, remember that the file contents can change often due to user modifications, and that calling apply_patch with incorrect context is very costly. Therefore, if you want to call apply_patch on a file that you have not opened with the read_file tool within your last five (5) messages, you should use the read_file tool to read the file again before attempting to apply a patch. Furthermore, do not attempt to call apply_patch more than three times consecutively on the same file without calling read_file on that file to re-confirm its contents.</p>

  <p><em><a href="https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools/blob/8ffe2e8/Cursor%20Prompts/Agent%20Prompt%202025-09-03.txt?plain=1#L107">Cursor</a></em></p>

</blockquote>

<p>I’m impressed that Cursor has the confidence to provide such concrete heuristics for optimistic concurrency control:</p>
<ul>
  <li>If it hasn’t read a file within last 5 messages =&gt; consider stale</li>
  <li>If it has written 3 times without reading =&gt; consider stale</li>
</ul>

<p>While I haven’t tried it myself, I suspect these are likely related to the autocomplete / tab completions in Cursor, where we can expect a lot more user-model co-authorship than with other CLI tools. This implies Cursor has some interesting tool call chains which are quite different from other CLI tools I’ve listed here. It could be similar to Copilot and Windsurf which also have autocomplete as a primary UX.</p>

<p>Further, while it has Composer which can be RL’d on such tool-use trajectories, it has to instruct models like GPT / Opus to work well with its tool use trajectories.</p>

<hr />

<blockquote>
  <p>The user is working on the same computer as you, and has access to your work. As such there’s no need to show the full contents of large files you have already written unless the user explicitly asks for them. Similarly, if you’ve created or modified files using <code class="language-plaintext highlighter-rouge">apply_patch</code>, there’s no need to tell users to “save the file” or “copy the code into a file”—just reference the file path.</p>

  <p><em><a href="https://github.com/openai/codex/blob/932a5a4/codex-rs/core/prompt.md?plain=1#L222">Codex CLI</a></em></p>

</blockquote>

<p>Crudely, the model thinks it is still in ChatGPT where it assumes the user has to copy/save to move forward. The chat experience is deeply embedded in the weights. Unsurprising, but still interesting that this instruction shows up in Codex’s system prompt, and not the others. The fact that Codex spells this out suggests GPT-style models have a strong prior toward transcript-style delivery, and Codex needs an explicit override to get “workspace-native” behavior.</p>

<p>Perhaps this deep rooted behaviour explains more than just this one instruction. Does it explain the need for the codex family of models in the first place? And does it actually imply that its heavy push towards autonomy is a model-related requirement rather than a user-focused product requirement?</p>

<h2 id="some-more-that-i-couldnt-get-to">Some more that I couldn’t get to</h2>

<p>I might get to them at some point. Or you can try the exercise. I’d love to hear your conjectures!</p>

<hr />

<blockquote>
  <p>NEVER generate an extremely long hash or any non-textual code, such as binary. These are not helpful to the USER and are very expensive.</p>

  <p><em><a href="https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools/blob/8ffe2e8/Cursor%20Prompts/Agent%20Prompt%202025-09-03.txt?plain=1#L106">Cursor</a></em></p>

</blockquote>

<p>The model was… generating binary. And this was important enough to put into the system prompt with a <code class="language-plaintext highlighter-rouge">NEVER</code> in caps. The model is pre-trained on enough binary or hex information? I haven’t seen or heard about this happening in practice. Is this truly an edge case fix showing up in system prompts?</p>

<hr />

<blockquote>
  <p>IT IS CRITICAL TO FOLLOW THESE GUIDELINES TO AVOID EXCESSIVE TOKEN CONSUMPTION.</p>

  <p><em><a href="https://github.com/google-gemini/gemini-cli/blob/e79b149/packages/core/src/core/prompts.ts?plain=1#L229">Gemini CLI</a></em></p>

</blockquote>

<blockquote>
  <p>Each action you take is somewhat expensive.</p>

  <p><em><a href="https://github.com/All-Hands-AI/OpenHands/blob/7853b41/openhands/agenthub/codeact_agent/prompts/system_prompt.j2?plain=1#L9">OpenHands</a></em></p>

</blockquote>

<p>The irony of adding tokens to the system prompt telling the model to use fewer tokens. Many system prompts talk about “expensive” actions, but how aware is the model that token consumption is expensive? And why did Gemini or OpenHands have to add this instruction?</p>

<hr />

<blockquote>
  <p>Do not add tests to codebases with no tests.</p>

  <p><em><a href="https://github.com/openai/codex/blob/932a5a4/codex-rs/core/prompt.md?plain=1#L188">Codex CLI</a></em></p>

</blockquote>

<p>vs.</p>

<blockquote>
  <p>When adding features or fixing bugs, this includes adding tests to ensure quality.</p>

  <p><em><a href="https://github.com/google-gemini/gemini-cli/blob/e79b149/packages/core/src/core/prompts.ts?plain=1#L148">Gemini CLI</a></em></p>

</blockquote>

<p>Two tools. Opposite opinions on the same thing. One says “embrace the chaos” and the other says “be the change you wish to see.” Why?</p>
]]></content>
    
  </entry>
  
  <entry>
    
    <title type="html"><![CDATA[How System Prompts Reveal Model Biases]]></title>
    <link href="http://blog.nilenso.com/blog/2026/02/12/how-system-prompts-reveal-model-biases/"/>
    
    <updated>2026-02-12T00:00:00+00:00</updated>
    <id>http://blog.nilenso.com/blog/2026/02/12/how-system-prompts-reveal-model-biases</id>
    <summary type="html"><![CDATA[
]]></summary>
    
    <content type="html"><![CDATA[<div style="border-left:4px solid #FF3D84; background:#fafafa; padding:0.75em 1em; border-radius:4px; margin:1em 0;">

  <p>This article is a deep-dive into one aspect of <a href="https://www.dbreunig.com/2026/02/10/system-prompts-define-the-agent-as-much-as-the-model.html">How System Prompts Define Agent Behavior</a>, an article co-authored with <a href="https://dbreunig.com">Drew Breunig</a>.</p>

</div>

<p>Developers of coding agents use system prompts to patch bad behavior or force good behavior. Some of these patches are clear examples of <a href="https://www.dbreunig.com/2025/11/11/don-t-fight-the-weights.html">fighting-the-weights</a>, where one has to repeat the instructions, or say it in ALL CAPS, or use forceful language like MUST, NEVER, ALWAYS, etc.</p>

<p>This struggle proves that the model is biased in specific ways. Knowing these biases is very useful since we use these models for work every day. Further, looking at these scars lets us make meaningful conjectures about <em>why</em> the model is biased, and that might reveal hidden details about the data it learned from, or how it was trained.</p>

<p>While there are several such topics that require fighting models, I’ll take up two topics in this post that appear across many system prompts: tool call parallelism, and comments in code.</p>

<h2 id="tool-call-parallelism">Tool call parallelism</h2>

<p>Models need to be told multiple times, and forcefully to batch tool calls, or to execute them in parallel. Here are relevant extracts from various system prompts:</p>

<ol>
  <li>Claude Code
    <ul>
      <li><code class="language-plaintext highlighter-rouge">You can call multiple tools in a single response.</code> - This line appears 7 times in the system prompt! Once in the generic tool use policy, and then it is repeated inside almost every tool’s instruction.</li>
      <li><code class="language-plaintext highlighter-rouge">If you intend to call multiple tools and there are no dependencies between the calls, make all of the independent calls in the same response.</code> - This is repeated 4 times, right next to the previous sentence.</li>
      <li><code class="language-plaintext highlighter-rouge">Maximize use of parallel tool calls where possible to increase efficiency</code></li>
      <li><code class="language-plaintext highlighter-rouge">If the user specifies that they want you to run tools "in parallel", you MUST send a single message with multiple tool use content blocks.</code> – This is repeated twice.</li>
      <li><code class="language-plaintext highlighter-rouge">For example, if you need to launch multiple agents in parallel, send a single message with multiple Task tool calls.</code> This appears in 3 different ways in 3 different tool instructions.</li>
    </ul>
  </li>
  <li>Cursor
    <ul>
      <li>Has a full section called <code class="language-plaintext highlighter-rouge">maximize_parallel_tool_calls</code> with repeated instructions, and in SCREAMING case.</li>
      <li><code class="language-plaintext highlighter-rouge">CRITICAL INSTRUCTION: For maximum efficiency, whenever you perform multiple operations, invoke all relevant tools concurrently with multi_tool_use.parallel rather than sequentially</code></li>
      <li><code class="language-plaintext highlighter-rouge">MANDATORY: Run multiple Grep searches in parallel with different patterns and variations;</code></li>
      <li><code class="language-plaintext highlighter-rouge">For instance, all of these cases SHOULD use parallel tool calls:</code></li>
      <li><code class="language-plaintext highlighter-rouge">Searching for different patterns (imports, usage, definitions) should happen in parallel</code></li>
      <li><code class="language-plaintext highlighter-rouge">And you should use parallel tool calls in many more cases beyond those listed above</code></li>
      <li><code class="language-plaintext highlighter-rouge">DEFAULT TO PARALLEL: Unless you have a specific reason why operations MUST be sequential</code></li>
      <li><code class="language-plaintext highlighter-rouge">Parallelize tool calls per &lt;maximize_parallel_tool_calls&gt;: batch read-only context reads and independent edits instead of serial drip calls.</code></li>
    </ul>
  </li>
  <li>Gemini CLI
    <ul>
      <li>This seems to be the mildest of the lot, although we only know about this from Gemini 3 onwards.</li>
      <li><code class="language-plaintext highlighter-rouge">Use 'grep' and 'glob' search tools extensively (in parallel if independent) to understand file structures</code></li>
      <li><code class="language-plaintext highlighter-rouge">Execute multiple independent tool calls in parallel when feasible (i.e. searching the codebase)</code></li>
      <li><code class="language-plaintext highlighter-rouge">If you need to read multiple files, you should make multiple parallel calls to 'read_file'.</code></li>
      <li><code class="language-plaintext highlighter-rouge">**Parallelism:** Execute multiple independent tool calls in parallel when feasible (i.e. searching the codebase).</code></li>
    </ul>
  </li>
  <li>Kimi CLI
    <ul>
      <li>This has just one emphatic line. But this is the smallest system prompt of the lot too.</li>
      <li><code class="language-plaintext highlighter-rouge">you are HIGHLY RECOMMENDED to make them in parallel to significantly improve efficiency</code></li>
    </ul>
  </li>
  <li>Codex CLI
    <ul>
      <li>Codex added support for parallel tool calls only a few months ago. With the <a href="https://github.com/openai/codex/blob/f5d9939cd/codex-rs/core/templates/parallel/instructions.md">original implementation</a>, the instructions were quite explicit, and comparable to other models.
 -<code class="language-plaintext highlighter-rouge">Only make sequential calls if you truly cannot know the next file without seeing a result first.</code></li>
      <li><code class="language-plaintext highlighter-rouge">Always maximize parallelism.</code>, <code class="language-plaintext highlighter-rouge">Batch everything.</code>, <code class="language-plaintext highlighter-rouge">Never read files one-by-one unless logically unavoidable.</code>.</li>
      <li>However, with the 5.2 model release, all this instruction vanished, and was replaced with a single line instruction.</li>
      <li><code class="language-plaintext highlighter-rouge">Parallelize tool calls whenever possible - especially file reads, such as cat, rg, sed, ls, git show, nl, wc. Use multi_tool_use.parallel to parallelize tool calls and only this.</code></li>
    </ul>
  </li>
</ol>

<p>I haven’t measured the resultant level of parallelism in these harnesses yet. Perhaps the model <em>does</em> parallelise some tool calls by itself, just not enough. And perhaps some of these products don’t care about how quickly the work gets done, so they haven’t paid attention to it yet. The benchmarks like SWE Bench Pro don’t measure execution time, so they haven’t had the incentive to, perhaps.</p>

<p>My conjectures:</p>
<ol>
  <li>Models don’t understand that batching multiple tool calls causes the harness to parallelise them for efficiency. This is not in the training data, and it’s likely not rewarded as a behaviour. Yet.</li>
  <li>RL Environments for tool use likely do not have parallelism in tool execution. Or they don’t reward that kind of efficiency. The inference-time harnesses are likely bare-bones, and just give feedback on simple things like whether the code works, and does what it should.</li>
  <li>Instructing the model (forcefully) has the intended effect, so they have some limited ability to follow that instruction. So, using the words <code class="language-plaintext highlighter-rouge">parallel</code>, <code class="language-plaintext highlighter-rouge">batch</code>, <code class="language-plaintext highlighter-rouge">multiple</code> etc in user-prompts will likely nudge the models to be more time-efficient. And the corollary would be that telling it not to parallelise might help with being token efficient at the cost of speed.</li>
  <li>Newer OpenAI models are likely rewarded for parallel tool calls. Many next-generation models will also be trained this way, leading to this instruction becoming unnecessary.</li>
  <li>There isn’t enough emphasis on striking a balance between reading many files in parallel and token efficiency. Currently, as per the system prompts, the balance seems to tip in the favour of speed over token efficiency. As the token economics change over the years, the models’ rewards would need rewiring, and the prompts might need re-writing. So, we might actually see different model versions or adapter layers that skew the efficiency parameters differently.</li>
</ol>

<hr />

<h2 id="comments-in-code">Comments in code</h2>

<p>Every system-prompt seems to instruct the model NOT to add comments in code.</p>

<ol>
  <li>Cursor CLI
    <ul>
      <li>It has a dedicated comments section.
        <ul>
          <li><code class="language-plaintext highlighter-rouge">Do not add comments for trivial or obvious code. Where needed, keep them concise</code></li>
          <li><code class="language-plaintext highlighter-rouge">Add comments for complex or hard-to-understand code; explain "why" not "how"</code></li>
          <li><code class="language-plaintext highlighter-rouge">Never use inline comments. Comment above code lines or use language-specific docstrings for functions</code></li>
          <li><code class="language-plaintext highlighter-rouge">Avoid TODO comments. Implement instead</code></li>
        </ul>
      </li>
      <li><code class="language-plaintext highlighter-rouge">Do not add narration comments inside code just to explain actions</code> - twice</li>
      <li><code class="language-plaintext highlighter-rouge">Do not add comments for trivial or obvious code. Where needed, keep them concise</code></li>
      <li><code class="language-plaintext highlighter-rouge">Use meaningful variable names as described in Martin's "Clean Code": Descriptive enough that comments are generally not needed</code></li>
    </ul>
  </li>
  <li>Gemini CLI
    <ul>
      <li>This is a part of it’s core mandates: <code class="language-plaintext highlighter-rouge">Add code comments sparingly. Focus on why something is done, especially for complex logic, rather than what is done. Only add high-value comments if necessary for clarity or if requested by the user. Do not edit comments that are separate from the code you are changing. NEVER talk to the user or describe your changes through comments.</code></li>
      <li><code class="language-plaintext highlighter-rouge">Do not add explanatory comments within tool calls or code blocks unless specifically part of the required code/command itself.</code></li>
    </ul>
  </li>
  <li>Codex CLI
    <ul>
      <li>Older versions of the system prompt had this instruction to <em>remove</em> inline comments:
        <ul>
          <li><code class="language-plaintext highlighter-rouge">Remove all inline comments you added as much as possible, even if they look normal. Check using git diff. Inline comments must be generally avoided, unless active maintainers of the repo, after long careful study of the code and the issue, will still misinterpret the code without the comments.</code></li>
          <li><code class="language-plaintext highlighter-rouge">Do not add inline comments within code unless explicitly requested.</code></li>
        </ul>
      </li>
      <li>Newer version of the system prompt has this instruction to <em>add</em> comments sparingly:
        <ul>
          <li><code class="language-plaintext highlighter-rouge">Add succinct code comments that explain what is going on if code is not self-explanatory. You should not add comments like "Assigns the value to the variable", but a brief comment might be useful ahead of a complex code block that the user would otherwise have to spend time parsing out. Usage of these comments should be rare.</code></li>
        </ul>
      </li>
    </ul>
  </li>
  <li>Claude Code
    <ul>
      <li><code class="language-plaintext highlighter-rouge">Do not add comments to the code you write, unless the user asks you to, or the code is complex and requires additional context.</code></li>
      <li><code class="language-plaintext highlighter-rouge">Don't add docstrings, comments, or type annotations to code you didn't change. Only add comments where the logic isn't self-evident.</code></li>
      <li><code class="language-plaintext highlighter-rouge">Never use tools like Bash or code comments as means to communicate with the user during the session.</code></li>
    </ul>
  </li>
</ol>

<p>Conjecture time. Why is this prompting necessary? Good code doesn’t have comments right?</p>

<ol>
  <li>Models are trained to produce tokens to reason, and to get more accurate answers. Models, through RL(HF/VR) have a tendency to be verbose about their answers. This capability is generic and doesn’t only apply to chat, its a personality that leaks into writing code as well. Claude and Gemini reasoning and talking to the user in comments has been observed by many users.</li>
  <li>Models are trained on material that tends to be comment heavy, like snippets, training manuals, notebooks, tutorials, and competitive coding solutions. And the volume of that content is significant enough to bias the weights.</li>
  <li>Models aren’t rewarded for being token efficient in writing code, and aren’t negatively rewarded for writing comments.</li>
  <li>Comments aren’t the only aspect that’s biased poorly towards writing good code. Most prompts also have instructions to write minimal code, not over-engineer, reuse existing abstractions, etc. These are also reflections of the training data.</li>
  <li>Because learning-code and professional-code tend to look very different in practice, a model that prioritises one over the other in its training data might have very different behaviour.</li>
</ol>

<hr />

<p>In <a href="https://blog.nilenso.com/blog/2026/02/12/weird-system-prompt-artefacts/">another article</a>, I wrote about a variety of weird system prompt artefacts. Look up the system prompts of your favourite products, and see what model bias they’re fighting. It would leave you with a better understanding of its limitations.</p>

<p>One parting conjecture: RL is a great way to learn / unlearn some of these biases, but that requires the harnesses to be a part of the RL environment. If the inference-time harnesses get more sophisticated over time, that model+harness combo is likely to be the most reliable and efficient one, and it’s development is likely to be as opaque as RL is today.</p>
]]></content>
    
  </entry>
  
  <entry>
    
    <title type="html"><![CDATA[Codex CLI vs Claude Code on autonomy]]></title>
    <link href="http://blog.nilenso.com/blog/2026/02/12/codex-cli-vs-claude-code-on-autonomy/"/>
    
    <updated>2026-02-12T00:00:00+00:00</updated>
    <id>http://blog.nilenso.com/blog/2026/02/12/codex-cli-vs-claude-code-on-autonomy</id>
    <summary type="html"><![CDATA[
]]></summary>
    
    <content type="html"><![CDATA[<div style="border-left:4px solid #FF3D84; background:#fafafa; padding:0.75em 1em; border-radius:4px; margin:1em 0;">

  <p>This article is a deep-dive into one aspect of <a href="https://www.dbreunig.com/2026/02/10/system-prompts-define-the-agent-as-much-as-the-model.html">How System Prompts Define Agent Behavior</a>, an article co-authored with <a href="https://dbreunig.com">Drew Breunig</a>.</p>

</div>

<p>I spent some time studying the system prompts of coding agent harnesses like <a href="https://github.com/openai/codex/blob/main/codex-rs/core/gpt_5_2_prompt.md">Codex CLI</a> and <a href="https://github.com/asgeirtj/system_prompts_leaks/blob/main/Anthropic/claude-code-2025-11-1.md">Claude Code</a>. These prompts reveal the priorities, values, and scars of their products. They’re only a few pages each and worth reading in full, especially if you use them every day. This approach to understanding such products is more grounded than the vibe-based takes you often see in feeds.</p>

<p>While there are many similarities and differences between them, one of the most commonly perceived differences between Claude Code and Codex CLI is <strong>autonomy</strong>, and in this post I’ll share what I observed. We tend to perceive autonomous behaviour as long-running, independent, or requiring less supervision and guidance. Reading the system prompts, it becomes apparent that <em>the products make very different, and very intentional choices</em>.</p>

<h3 id="you-are-a">You are a…</h3>

<p>Right from the start, they diverge in how they define their identity. Claude Code has always described itself as an “interactive tool/agent to help the user”, whereas Codex has long taken the stance of “a coding agent”. The latest 5.3-codex release shows a slight reversal in this position with the addition of “collaboration” with the user.</p>

<p>Even though this might be a single line, this defines the identiy, and is at the beginning of every single instruction to the LLM. So, this line holds weight. Here are the relevant excerpts from the prompts over time:</p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>Identity in the system prompt</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><a href="https://github.com/nilenso/long-prompts-analysis/blob/main/codex-and-claude-system-prompts/data/codex/base-models/2026-01-12_87f7226cc_gpt_5_2_prompt.txt">gpt-5.2</a></td>
      <td>You are GPT-5.2 running in the Codex CLI, a terminal-based coding assistant.</td>
    </tr>
    <tr>
      <td><a href="https://github.com/nilenso/long-prompts-analysis/blob/main/codex-and-claude-system-prompts/data/codex-cli-model-releases-filtered/2025-12-18_gpt-5.2-codex_caribou.txt#L1">gpt-5.2-codex</a></td>
      <td>You are Codex, based on GPT-5. You are running as a coding agent in the Codex CLI on a user’s computer.</td>
    </tr>
    <tr>
      <td><a href="https://github.com/nilenso/long-prompts-analysis/blob/main/codex-and-claude-system-prompts/data/codex/codex-models/2026-02-05_9efb7f4a1_gpt-5.3-codex-prompt.txt#L1">gpt-5.3-codex</a></td>
      <td>You are Codex, a coding agent based on GPT-5. You and the user share the same workspace and collaborate to achieve the user’s goals</td>
    </tr>
    <tr>
      <td><a href="https://github.com/nilenso/long-prompts-analysis/blob/main/codex-and-claude-system-prompts/data/claude-code-model-releases-filtered/06_2025-09-25_v1.0.125_pre-v2-peak.md">Sonnet-4.5</a></td>
      <td>You are Claude Code, Anthropic’s official CLI for Claude. You are an interactive CLI tool that helps users with software engineering tasks.</td>
    </tr>
    <tr>
      <td><a href="https://github.com/nilenso/long-prompts-analysis/blob/main/codex-and-claude-system-prompts/data/claude-code-model-releases-filtered/17_2026-02-10_v2.1.39.md">Opus-4.6</a></td>
      <td>You are a Claude agent, built on Anthropic’s Claude Agent SDK. You are an interactive agent that helps users with software engineering tasks.</td>
    </tr>
  </tbody>
</table>

<p>Both of them seem to be moving towards more autonomy, with assistant -&gt; coding agent. But the “help users with software engineering <em>tasks</em>” vs “collaborate to achieve the user’s <em>goals</em>” still signals different levels of agency at which it’s meant to operate.</p>

<h3 id="should-it-stop-and-ask-questions-or-keep-going">Should it stop and ask questions, or keep going?</h3>

<p>Codex includes a critical and <a href="https://github.com/openai/codex/blob/932a5a446f42e566c0dbd3004ae2cee50cdcc0ce/codex-rs/core/gpt_5_2_prompt.md#autonomy-and-persistence">explicit section</a> for “Autonomy and Persistence” for the non-Codex models.</p>

<blockquote>
  <p>Persist until the task is <strong>fully handled end-to-end</strong> within the current turn whenever feasible: <strong>do not stop</strong> at analysis or partial fixes; carry changes through implementation, verification, and a clear explanation of outcomes <strong>unless the user explicitly pauses</strong> or redirects you.</p>
</blockquote>

<p>Notice the language of <code class="language-plaintext highlighter-rouge">do not stop</code>, and <code class="language-plaintext highlighter-rouge">unless the user explicitly pauses</code>. And later in the prompt, there’s a <a href="https://github.com/openai/codex/blob/main/codex-rs/core/gpt_5_2_prompt.md#task-execution">task execution</a> section that doubles down on this.</p>

<blockquote>
  <p>You <strong>must keep going</strong> until the query or task is completely resolved, before ending your turn and yielding back to the user. Persist until the task is fully handled end-to-end within the current turn whenever feasible and <strong>persevere even when function calls fail</strong>. Only terminate your turn when you are sure that the problem is solved. <strong>Autonomously resolve the query to the best of your ability</strong>, using the tools available to you, before coming back to the user.</p>
</blockquote>

<p>If I were the model reading these instructions, I would take this to mean: “I should try my best to solve the problem myself and not yield to the user.”</p>

<hr />

<p>Claude, on the other hand, has an “Asking questions as you work” <a href="https://gist.github.com/chigkim/1f37bb2be98d97c952fd79cbb3efb1c6#file-claude-code-txt-L72">section</a>, and a <code class="language-plaintext highlighter-rouge">AskUserQuestion</code> <a href="https://github.com/Piebald-AI/claude-code-system-prompts/blob/c3115b8df18bdbf13dc6bf6e983afd67ec852332/system-prompts/tool-description-askuserquestion.md?plain=1#L4">tool</a>, that it is explicitly encouraged to use:</p>

<blockquote>
  <p>You have access to the AskUserQuestion tool to ask the user questions when you need clarification, want <strong>to validate assumptions</strong>, or need <strong>to make a decision</strong> you’re unsure about.
Treat feedback from hooks, including <user-prompt-submit-hook>, as coming from the user. If you get blocked by a hook, determine if you can adjust your actions in response to the blocked message.
...
Use this tool when you need to ask the user questions during execution. This allows you to:</user-prompt-submit-hook></p>
  <ol>
    <li>Gather user preferences or requirements</li>
    <li>Clarify ambiguous instructions</li>
    <li>Get decisions on implementation choices as you work</li>
    <li><strong>Offer choices to the user</strong> about what direction to take.</li>
  </ol>
</blockquote>

<p>If I were the model, I would interpret this as “I need to be cautious; I’ll check with the user before going ahead.”</p>

<h3 id="should-it-proactively-take-action-or-propose-a-solution-first">Should it proactively take action, or propose a solution first?</h3>

<p>When there’s ambiguity about whether to write code or take action, it can look, at a surface level, like they make the same choice: when the user is asking questions or planning, don’t write code. But the manner in which they make the choice is quite different.</p>

<p>Codex’s prompt encourages the model to be <strong>bold</strong> about writing code:</p>

<blockquote>
  <p>Unless the user explicitly asks for a plan, asks a question about the code, is brainstorming potential solutions, or some other intent that makes it clear that code should not be written, <strong>assume the user wants you to make code changes or run tools to solve the user’s problem</strong>. In these cases, <strong>it’s bad to output your proposed solution in a message</strong>, you should go ahead and actually implement the change. If you encounter challenges or blockers, <strong>you should attempt to resolve them yourself.</strong></p>
</blockquote>

<p>Claude’s prompt had a “Proactiveness” <a href="https://github.com/asgeirtj/system_prompts_leaks/blob/main/Anthropic/claude-code.md#proactiveness">section</a> that encourages the model to be <strong>cautious</strong> about writing code. This section has been refactored into various tool instructions in recent versions of the prompt, but the general outlook still remains the same:</p>

<blockquote>
  <p>You are allowed to <strong>be proactive, but only when the user asks</strong> you to do something. You should strive to strike a balance between:</p>
  <ul>
    <li>Doing the right thing <strong>when asked</strong>, including taking actions and follow-up actions</li>
    <li><strong>Not surprising the user</strong> with actions you take without asking For example, if the user asks you how to approach something, you should do your best to answer their question first, and <strong>not immediately jump into taking actions</strong>.</li>
    <li>Do not add additional code explanation summary unless requested by the user. <strong>After working on a file, just stop</strong>, rather than providing an explanation of what you did.</li>
  </ul>
</blockquote>

<h3 id="should-it-be-ambitious-and-creative-with-its-solutions">Should it be ambitious and creative with its solutions?</h3>

<p>Here, Codex leans on ambition (with a caveat), and Claude takes a fairly conservative stance.</p>

<p>Codex says:</p>

<blockquote>
  <p>For tasks that have no prior context (i.e. the user is starting something brand new), you should <strong>feel free to be ambitious</strong> and demonstrate creativity with your implementation.</p>
</blockquote>

<p>and</p>

<blockquote>
  <p>You should use judicious initiative to decide on the right level of detail and complexity to deliver based on the user’s needs. This means showing good judgment that <strong>you’re capable of doing the right extras without gold-plating</strong>. This might be demonstrated by <strong>high-value, creative touches</strong> when scope of the task is vague; while being surgical and targeted when scope is tightly specified.</p>
</blockquote>

<p>Claude’s prompt is heavily focused on restraint rather than ambition, and gives many examples of how <em>not</em> to be ambitious. From the “Doing tasks” <a href="https://github.com/Piebald-AI/claude-code-system-prompts/blob/c3115b8df18bdbf13dc6bf6e983afd67ec852332/system-prompts/system-prompt-doing-tasks.md?plain=1#L14">section</a>:</p>

<blockquote>
  <p><strong>Avoid over-engineering. Only make changes that are directly requested or clearly necessary. Keep solutions simple and focused.</strong></p>
  <ul>
    <li><strong>Don’t add features, refactor code, or make “improvements” beyond what was asked.</strong> A bug fix doesn’t need surrounding code cleaned up. A simple feature doesn’t need extra configurability. Don’t add docstrings, comments, or type annotations to code you didn’t change. Only add comments where the logic isn’t self-evident.</li>
    <li>Don’t add error handling, fallbacks, or validation for scenarios that can’t happen. Trust internal code and framework guarantees. Only validate at system boundaries (user input, external APIs). Don’t use feature flags or backwards-compatibility shims when you can just change the code.</li>
    <li>Don’t create helpers, utilities, or abstractions for one-time operations. Don’t design for hypothetical future requirements. <strong>The right amount of complexity is the minimum needed for the current task</strong> — three similar lines of code is better than a premature abstraction.</li>
  </ul>
</blockquote>

<p>The caveat for codex is that all the creativity is taken away when there’s an existing codebase! Although, amidst all the opposing instruction given to the model, I doubt this section gets enough attention.</p>

<blockquote>
  <p>If you’re operating in an existing codebase, you should make sure you do exactly what the user asks with surgical precision. Treat the surrounding codebase with respect, and don’t overstep (i.e. changing filenames or variables unnecessarily)</p>
</blockquote>

<h3 id="a-quick-note-on-gemini-cli-and-cursor-cli">A quick note on Gemini CLI and Cursor CLI</h3>

<p>Gemini CLI has an interactive mode and a non-interactive mode, which puts control over autonomy firmly in the user’s hands rather than letting the model decide.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">[Interactive mode] **Confirm Ambiguity/Expansion:** Do not take significant actions beyond the clear scope of the request without confirming with the user. If asked *how* to do something, explain first, don't just do it.</code></li>
  <li><code class="language-plaintext highlighter-rouge">[Non-interactive mode] **Continue the work** You are not to interact with the user. Do your best to complete the task at hand, using your best judgement and avoid asking user for any additional information."</code></li>
</ul>

<p>And Cursor CLI seems to take a similar route to Codex, giving the agent full autonomy:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">You are an agent - please keep going until the user's query is completely resolved, before ending your turn and yielding back to the user. Only terminate your turn when you are sure that the problem is solved. Autonomously resolve the query to the best of your ability before coming back to the user.</code></li>
  <li><code class="language-plaintext highlighter-rouge">State assumptions and continue; don't stop for approval unless you're blocked.</code></li>
</ul>

<h3 id="it-is-very-likely-that-codex-models-are-rld-on-this-behaviour">It is very likely that Codex models are RL’d on this behaviour</h3>

<p>In the <a href="https://github.com/openai/codex/blob/932a5a446f42e566c0dbd3004ae2cee50cdcc0ce/codex-rs/core/gpt-5.2-codex_prompt.md">5.2-codex model</a>’s prompt, the sections around autonomy, ambition, etc. are all gone. And its prompt is only half the size of prompt for GPT-5.2. And the codex model release notes mention that it is made for <a href="https://openai.com/index/gpt-5-1-codex-max/">long-running tasks</a>, which hints at autonomy being baked in through post-training.</p>

<p>Any customisation through model post-training is opaque to end users, unfortunately.</p>

<hr />

<h3 id="my-conclusions">My conclusions</h3>

<p>All this is my interpretation, of course, and I can’t know what parts of the system prompt get more attention during inference. From my experience in prompting these models though, I feel like they pick up on the general theme of instructions given the context, reading in between the words and filling the gaps where the words aren’t present, in order to interpret the author’s intentions. I guess I’m doing the same thing here.</p>

<p>From the analysis in this post, Codex CLI optimizes for task completion, and Claude Code optimizes for user alignment and consent. But my take-aways are broader:</p>

<ol>
  <li>System prompts are used to steer models into different behaviours. It <em>is</em> difficult to pull apart the model’s behaviour into prompt-based and training-based, so the extent of steer-ability is somewhat unknown. However, for example, <a href="https://drewbot-embn-m4otwurrj-drew-breunigs-projects.vercel.app/2026/02/10/system-prompts-define-the-agent-as-much-as-the-model.html">I’ve seen observable differences</a> when using Claude Code with Codex’s system prompt.</li>
  <li>While the models, harnesses, and tools might evolve, it appears to me as though the products themselves are differently positioned, and possibly headed in different directions. At the very least, they operate with different philosophies of what a coding agent should do.</li>
  <li>If you want to understand and wield your AI tools better, read their system prompts.</li>
</ol>
]]></content>
    
  </entry>
  
  <entry>
    
    <title type="html"><![CDATA[How System Prompts Define Agent Behavior]]></title>
    <link href="http://blog.nilenso.com/blog/2026/02/10/how-system-prompts-define-agent-behaviiour/"/>
    
    <updated>2026-02-10T00:00:00+00:00</updated>
    <id>http://blog.nilenso.com/blog/2026/02/10/how-system-prompts-define-agent-behaviiour</id>
    <summary type="html"><![CDATA[
]]></summary>
    
    <content type="html"><![CDATA[<div style="border-left:4px solid #FF3D84; background:#fafafa; padding:0.75em 1em; border-radius:4px; margin:1em 0;">

  <p><em>This post was co-authored with <a href="https://x.com/dbreunig">Drew Breunig</a>, and you can also read it on <a href="https://www.dbreunig.com/2026/02/10/system-prompts-define-the-agent-as-much-as-the-model.html">his blog</a></em>.</p>

</div>

<p>Coding agents are fascinating to study. They help us build software in a new way, while themselves exemplifying a novel approach to architecting and implementing software. At their core is an AI model, but wrapped around it is a mix of code, tools, and prompts: the harness.</p>

<p>A critical part of this harness is the system prompt, the baseline instructions for the application. This context is present in every call to the model, no matter what skills, tools, or instructions are loaded. The system prompt is always present, defining a core set of behaviors, strategies, and tone.</p>

<p>Once you start analyzing agent design and behavior, a question emerges: how much does the system prompt actually determine an agent’s effectiveness? We take for granted that the model is the most important component of any agent, but how much can a system prompt contribute? Could a great system prompt paired with a mediocre model challenge a mediocre prompt paired with a frontier model?</p>

<p>To find out, we obtained and analyzed system prompts from six different coding agents. We clustered them semantically, comparing where their instructions diverged and where they converged. Then we swapped system prompts between agents and observed how behavior changed.</p>

<p>System prompts matter far more than most assume. A given model sets the theoretical ceiling of an agent’s performance, but the system prompt determines whether this peak is reached.</p>

<hr />

<h3 id="the-variety-of-system-prompts">The Variety of System Prompts</h3>

<p>To understand the range of system prompts, we looked at six CLI coding agents: Claude Code, Cursor, Gemini CLI, Codex CLI, OpenHands, and Kimi CLI. Each performs the same basic function: given a task they gather information, understands the code base, writes code, tracks their progress, and runs commands. But despite these similarities, the system prompts are <em>quite</em> different.</p>

<p><img src="/images/waffle_comparison.jpg" alt="Waffle chart comparisons of 6 coding agent system prompts" /></p>

<div style="margin: 0 0 2em; text-align: center; color: gray; border-bottom: 1px dashed gray;">

  <p>Explore the above figures interactively <a href="https://nilenso.github.io/context-viewer/g/960d42ad-314c-44cf-8594-4b009ef528a1/comparison?sidebar=0&amp;panel=0&amp;sortBy=category&amp;sortDir=asc&amp;import=https://raw.githubusercontent.com/nilenso/long-prompts-analysis/refs/heads/main/context-viewer-exports/system-prompts-simpler.json">in <strong>context viewer</strong></a>.</p>

</div>

<p>We’re analyzing <a href="https://github.com/asgeirtj/system_prompts_leaks">exfiltrated system prompts</a>, which we clean up and <a href="https://github.com/nilenso/long-prompts-analysis/tree/main/data/prompts/filtered">host here</a><sup id="fnref:exfiltrated" role="doc-noteref"><a href="#fn:exfiltrated" class="footnote" rel="footnote">1</a></sup>. Each of these is fed into <a href="https://github.com/nilenso/context-viewer">context-viewer</a>, a tool I developed that chunks contexts in semantic components for exploration and analysis.</p>

<p>Looking at the above visualizations, there is plenty of variety. Claude, Codex, Gemini, and OpenHands roughly prioritize the same instructions, but vary their distributions. Further, prompts for Claude Code and OpenHands both are less than half the length of prompts in Codex and Gemini.</p>

<p>Cursor’s and Kimi’s prompts are dramatically different. Here we’re looking at Cursor’s prompt that’s paired with GPT-5 (<a href="https://www.adiasg.com/blog/comparing-cursors-prompts-across-models">Cursor uses slightly different prompts when hooked to different models</a>), and it spends over a third of its tokens on personality and steering instructions. Kimi CLI, meanwhile, contains zero workflow guidance, barely hints at personality instructions, and is the shortest prompt by far.</p>

<p>Given the similar interfaces of these apps, we’re left wondering: why are their system prompts so different?</p>

<p>There’s two main reasons the system prompts vary: <em>model calibration</em> and <em>user experience</em>.</p>

<p>Each model has its own quirks, rough edges, and baseline behaviors. If the goal is to produce a measured, helpful TUI coding assistant, each system prompt will have to deal with and adjust for unique aspects of the underlying model to achieve this goal. This <em>model calibration</em> reins in problematic behavior.</p>

<p>System prompts also vary because they specify slightly different <em>user experience</em>. Sure, they’re all text-only, terminal interfaces that explore and manipulate code. But some are more talkative, more autonomous, more direct, or require more detailed instructions. System prompts define this UX and, as we’ll see later, we can make a coding agent “feel” like a different agent just by swapping out the system prompt.</p>

<p>We can get a glimpse of these two functions together by looking at how a given system prompt changes over time, especially as new versions of models arrive. For example:</p>

<p><img src="/images/claude_over_time.jpg" alt="Claude's system prompt vascilates as new models are released, but trends steadily longer" /></p>

<div style="margin: 0 0 2em; text-align: center; color: gray; border-bottom: 1px dashed gray;">

  <p>Explore the above figures interactively <a href="https://nilenso.github.io/context-viewer/g/b179a05f-2bd4-4012-83ab-42a0cb1e79fd/comparison?sidebar=0&amp;panel=0&amp;legend=compact&amp;sortBy=category&amp;sortDir=asc&amp;cols=5&amp;import=https://raw.githubusercontent.com/nilenso/long-prompts-analysis/refs/heads/main/context-viewer-exports/claude-prompt-evolution-export-simpler.json">in <strong>context viewer</strong></a>. Or, check out <a href="https://nilenso.github.io/context-viewer/g/56b68fb5-7221-4c04-807e-b590f138c1fe/comparison?sidebar=0&amp;panel=0&amp;view=tokens-absolute&amp;legend=compact&amp;sortBy=category&amp;sortDir=asc&amp;cols=10&amp;spr=4&amp;import=https://raw.githubusercontent.com/nilenso/long-prompts-analysis/refs/heads/main/context-viewer-exports/codex-prompt-evolution-export-only-codex.json">Codex’s system prompt evolution</a> in similar detail.</p>

</div>

<p>Note how the system prompt isn’t stable, nor growing in a straight line. It bounces around a bit, as the Claude Code team tweaks the prompt to both adjust new behaviors and smooth over the quirks of new models. Though the trend is a march upward, as the coding agent matures.</p>

<p>If you want to dive further into Claude Code’s prompt history, Mario Zechner has <a href="https://cchistory.mariozechner.at">an excellent site</a> where he highlights the exact changes from version to version.</p>

<div style="border-left:4px solid #FF3D84; background:#fafafa; padding:0.75em 1em; border-radius:4px; margin:1em 0;">

  <p>Sometimes instructions are just…<em>weird</em>. I <a href="https://blog.nilenso.com/blog/2026/02/12/weird-system-prompt-artefacts/">cataloged</a> some of the odder instructions I found while exploring coding agent system prompts.</p>

</div>

<hr />

<h3 id="the-common-jobs-of-a-coding-agent-system-prompt">The Common Jobs of a Coding Agent System Prompt</h3>

<p>While these prompts vary from tool to tool, there are many commonalities that each prompt features. There is clear evidence that these teams are <a href="https://www.dbreunig.com/2025/11/11/don-t-fight-the-weights.html">fighting the weights</a>: they use repeated instructions, all-caps admonishments, and stern warnings to adjust common behaviors. This shared effort suggests common patterns in their training datasets, which each has to mitigate.</p>

<p>For example, there are <em>many</em> notes about how these agents should use comments in their code. Cursor specifies that the model should, “not add comments for trivial or obvious code.” Claude states there should be no added comments, “unless the user asks you to.” Codex takes the same stance. Gemini instructions the model to, “Add code comments sparingly… NEVER talk to the user through comments.”</p>

<p>These consistent, repeated instructions are warranted. They fight against examples of conversation in code comments, present in countless codebases and Github repo. This behavior goes deep: we’ve even seen that Opus 4.5 will <a href="https://x.com/aidenybai/status/1993901129210712129">reason in code comments if you turn off thinking</a>.</p>

<p>System prompts also repeatedly specify that tool calls should be parallel whenever possible. Claude should, “maximize use of parallel tool calls where possible.” Cursor is sternly told, “CRITICAL INSTRUCTION: involve all relevant tools concurrently… DEFAULT TO PARALLEL.” Kimi adopts all-caps as well, stating, “you are HIGHLY RECOMMENDED to make [tool calls] in parallel.”</p>

<p>This likley reflects the face that most post-training reasoning and agentic examples are <em>serial</em> in nature. This is perhaps easier to debug and a bit of delay when synthesizing these datasets isn’t a hinderence. However, in real world situations, users certainly appreciate the speed, so system prompts need to override this training.</p>

<p>Both of these examples of <em>fighting the weights</em> demonstrate how system prompts are used to smooth over the quirks of each model (which they pick up during training) and improve the user experience in an agentic coding application.</p>

<p>Much of what these prompts specify is shared; common adjustments, common desired behaviors, and common UX. But their differences notably affect application behavior.</p>

<div style="border-left:4px solid #FF3D84; background:#fafafa; padding:0.75em 1em; border-radius:4px; margin:1em 0;">

  <p>I looked at more examples of fighting the weights to understand <a href="https://blog.nilenso.com/blog/2026/02/12/how-system-prompts-reveal-model-biases/">how system prompts reveal model biases</a>.</p>

</div>

<hr />

<h3 id="do-the-prompts-change-the-agent">Do the Prompts Change the Agent?</h3>

<p>Helpfully, <a href="https://opencode.ai">OpenCode</a> <a href="https://opencode.ai/docs/agents/#prompt">allows users to specify custom system prompts</a>. With this feature, we can drop in prompts from Kimi, Gemini, Codex and more, removing and swapping instructions to measure their contribution.</p>

<p>We gave SWE-Bench Pro test questions to two applications: two agents running the Claude Code harness, calling Opus 4.5, but with one one using the original Claude Code system prompt and the other armed with Codex’s instructions.</p>

<p>Time and time again, the agent workflows diverged immediately. For example:</p>

<p><img src="/images/prompt_swap.jpg" alt="Claude's system prompt defines a more iterative agent" /></p>

<p>The Codex prompt produced a methodical, documentation-first approach: understand fully, then implement once. The Claude prompt produced an iterative approach: try something, see what breaks, fix it.</p>

<p>This pattern remains consistent over many SWE Bench problems. If we average the contexts for each model and system prompt pair, we get the following:</p>

<p><img src="/images/swe_bench_prompts.jpg" alt="Swapping system prompts yielded different behavior for each model" /></p>

<div style="margin: 0 0 2em; text-align: center; color: gray; border-bottom: 1px dashed gray;">

  <p>Explore the above figures interactively <a href="https://nilenso.github.io/context-viewer/g/67175678-6244-45bc-b022-238b72f8e646/comparison?sidebar=0&amp;panel=0&amp;legend=compact&amp;sortBy=category&amp;sortDir=asc&amp;cols=5&amp;import=https://raw.githubusercontent.com/nilenso/long-prompts-analysis/refs/heads/main/context-viewer-exports/swapping-prompts-swe-tasks.json">in <strong>context viewer</strong></a>.</p>

</div>

<p>All prompt-model combinations correctly answered this subset of SWE Bench Pro questions. But <em>how</em> they suceeded was rather different. The system prompts shaped the workflows.</p>

<div style="border-left:4px solid #FF3D84; background:#fafafa; padding:0.75em 1em; border-radius:4px; margin:1em 0;">

  <p>I explored <a href="https://blog.nilenso.com/blog/2026/02/12/codex-cli-vs-claude-code-on-autonomy/">Codex CLI and Claude Code autonomy</a>, and how the system prompt may shape their behavior.</p>

</div>

<hr />

<h3 id="system-prompts-deserve-more-attention">System Prompts Deserve More Attention</h3>

<p>Last week, when Opus 4.6 and Codex 5.3 landed, people began putting them through the paces, trying to decide which would be their daily driver. Many tout the capabilities of one option over another, but just as often are complaints about approach, tone, or other discretionary choices. Further, it seems every week brings discussion of a new coding harness, especially for managing swarms of agents.</p>

<p>There is markedly less discussion about the system prompts that define the behaviors of these agents<sup id="fnref:foursix" role="doc-noteref"><a href="#fn:foursix" class="footnote" rel="footnote">2</a></sup>. System prompts define the UX and smooth over the rough edges of models. They’re given to the model with <em>every</em> instruction, yet we prefer to talk Opus vs. GPT-5.3 or Gastown vs. Pi.</p>

<p>Context engineering starts with the system prompt.</p>

<hr />
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:exfiltrated" role="doc-endnote">
      <p>Exfiltrated system prompts represent versions of the system prompt for a given session. It’s not 100% canonical, as many AI harnesses assemble system prompts from multiple snippets, given the task at hand. But given the consistent manner with which we can extrac these prompts, and comparing them with <a href="https://platform.claude.com/docs/en/release-notes/system-prompts">public</a> <a href="https://github.com/openai/codex/blob/d452bb3ae5b5e0f715bba3a44d7d30a51b5f28ae/codex-rs/core/prompt.md">examples</a>, we feel they are sufficiently representative for this analysis. <a href="#fnref:exfiltrated" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:foursix" role="doc-endnote">
      <p>Though you can use Mario’s <a href="https://cchistory.mariozechner.at/?from=2.1.31&amp;to=2.1.34">system prompt diff tool to explore the changes accompanying Opus 4.6’s release</a>. <a href="#fnref:foursix" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>
]]></content>
    
  </entry>
  
  <entry>
    
    <title type="html"><![CDATA[Reinforcement Learning with GRPO]]></title>
    <link href="http://blog.nilenso.com/blog/2026/01/28/reinforcement-learning-with-grpo/"/>
    
    <updated>2026-01-28T00:00:00+00:00</updated>
    <id>http://blog.nilenso.com/blog/2026/01/28/reinforcement-learning-with-grpo</id>
    <summary type="html"><![CDATA[Up until early 2024, Reinforcement Learning with Human Feedback (RLHF) was the gold standard for post-training models. The InstructGPT paper demonstrates how fine-tuning a small 1.3B model on human feedback using the Proximal Policy Optimization (PPO) algorithm can produce outputs that are preferred to those produced by a large 175B model. This process is effective, but operationally heavy, as shown in this diagram from the paper.
]]></summary>
    
    <content type="html"><![CDATA[<p>Up until early 2024, <strong>Reinforcement Learning with Human Feedback (RLHF)</strong> was the gold standard for post-training models. The <a href="https://arxiv.org/pdf/2203.02155">InstructGPT paper</a> demonstrates how fine-tuning a small 1.3B model on human feedback using the <strong>Proximal Policy Optimization (PPO)</strong> algorithm can produce outputs that are preferred to those produced by a large 175B model. This process is effective, but operationally heavy, as shown in this diagram from the paper.</p>

<p><img src="/images/blog/instructgpt-pipeline.png" alt="InstructGPT Training Pipeline" title="InstructGPT Training Pipeline" /></p>

<p>The above process has some drawbacks:</p>

<ul>
  <li>Data preparation is difficult and labor-intensive</li>
  <li>In a simplified view, RLHF with PPO requires training two models: the policy model, initialized from a base model, and the reward model trained from human preferences. This increases the training complexity and compute/memory resources dramatically.</li>
</ul>

<p>Is there a more resource-efficient way to train a model?</p>

<h1 id="reinforcement-learning-with-verifiable-rewards">Reinforcement Learning with Verifiable Rewards</h1>

<p><a href="https://arxiv.org/pdf/2402.03300">DeepSeek’s paper</a> in early 2024 demonstrated a powerful RL technique using verifiable rewards instead of a learned reward model. This approach was later named <strong>Reinforcement Learning with Verifiable Rewards (RLVR)</strong>.</p>

<p>The uniqueness of RLVR is in the way it assigns rewards. While RLHF uses a reward model, RLVR uses a verifiable, rule-based mechanism for computing rewards. This works best when correctness can be deterministically checked. For math problems, you can compare with the ground truth. For code generation, you can run tests. The absence of a reward model significantly reduces the computational resources needed and the complexity of the training pipeline.</p>

<p>The paper also introduced <strong>Group Relative Policy Optimization (GRPO)</strong>, a new policy algorithm that learns by comparing rewards among multiple sampled completions for a prompt. Since it does not require a reward model, it drastically reduces the memory required for training when compared to PPO-based RLHF.</p>

<h1 id="how-grpo-works">How GRPO works</h1>

<p>As the name suggests, Group Relative Policy Optimization (GRPO) works by computing the relative advantage of completions within a group to inform the model’s learning.</p>

<p>A step in model training using this method can be broken down into the following (simplified) steps:</p>
<ol>
  <li>Generate group of completions for a prompt</li>
  <li>Compute reward for each completion and then compute average reward for the group</li>
  <li>Compute relative advantage for each completion with respect to the average reward (<code class="language-plaintext highlighter-rouge">prompt_reward - mean_reward</code>)</li>
  <li>Compute loss for each completion using its advantage and token log probabilities</li>
  <li>Compute loss for the group</li>
  <li>Aggregate loss across prompts in a batch</li>
  <li>Compute final loss using the batch aggregate, the learning rate (to control the magnitude of the gradient update) and KL coefficient, the guardrail that prevents the new model from drifting too much from the base model</li>
  <li>Update parameters to move the model towards completions with higher advantages</li>
</ol>

<p>This creates a self-improving feedback loop where the model tends to get better every step by using the data from the previous step. Now that we have a high-level view of the process, let’s look at an example training pipeline to understand the setup.</p>

<h1 id="training-with-grpo">Training with GRPO</h1>

<p>Here’s an example that trains a small model to generate Python code using <a href="https://huggingface.co/docs/trl/main/en/grpo_trainer">GRPOTrainer</a> from HuggingFace’s <a href="https://huggingface.co/docs/trl/index">trl</a> library. This is intentionally kept minimal with a single binary reward function and a few samples from the dataset for testing. There is a lot more nuance in designing reward functions and tuning the training parameters for a production setup.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">os</span>
<span class="kn">import</span> <span class="n">tempfile</span>
<span class="kn">import</span> <span class="n">subprocess</span>

<span class="kn">import</span> <span class="n">torch</span>
<span class="kn">from</span> <span class="n">datasets</span> <span class="kn">import</span> <span class="n">load_dataset</span>
<span class="kn">from</span> <span class="n">transformers</span> <span class="kn">import</span> <span class="n">AutoTokenizer</span><span class="p">,</span> <span class="n">AutoModelForCausalLM</span>
<span class="kn">from</span> <span class="n">trl</span> <span class="kn">import</span> <span class="n">GRPOConfig</span><span class="p">,</span> <span class="n">GRPOTrainer</span>


<span class="c1"># This runs unit tests (verifiable reward) and gives 1.0 for pass, 0.0 otherwise.
</span><span class="k">def</span> <span class="nf">reward_fn</span><span class="p">(</span><span class="n">prompts</span><span class="p">,</span> <span class="n">completions</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">Return 1.0 if the completion passes the provided unit tests else 0.0.</span><span class="sh">"""</span>

    <span class="n">tests</span> <span class="o">=</span> <span class="n">kwargs</span><span class="p">[</span><span class="sh">"</span><span class="s">test</span><span class="sh">"</span><span class="p">]</span>
    <span class="n">entry_points</span> <span class="o">=</span> <span class="n">kwargs</span><span class="p">[</span><span class="sh">"</span><span class="s">entry_point</span><span class="sh">"</span><span class="p">]</span>

    <span class="n">batch_size</span> <span class="o">=</span> <span class="nf">len</span><span class="p">(</span><span class="n">prompts</span><span class="p">)</span>
    <span class="n">k</span> <span class="o">=</span> <span class="nf">len</span><span class="p">(</span><span class="n">completions</span><span class="p">)</span> <span class="o">//</span> <span class="n">batch_size</span>

    <span class="n">rewards</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">completion</span> <span class="ow">in</span> <span class="nf">enumerate</span><span class="p">(</span><span class="n">completions</span><span class="p">):</span>
        <span class="n">j</span> <span class="o">=</span> <span class="n">i</span> <span class="o">//</span> <span class="n">k</span>
        <span class="n">prompt</span> <span class="o">=</span> <span class="n">prompts</span><span class="p">[</span><span class="n">j</span><span class="p">]</span>
        <span class="n">test_code</span> <span class="o">=</span> <span class="n">tests</span><span class="p">[</span><span class="n">j</span><span class="p">]</span>
        <span class="n">entry_point</span> <span class="o">=</span> <span class="n">entry_points</span><span class="p">[</span><span class="n">j</span><span class="p">]</span>

        <span class="k">with</span> <span class="n">tempfile</span><span class="p">.</span><span class="nc">TemporaryDirectory</span><span class="p">()</span> <span class="k">as</span> <span class="n">td</span><span class="p">:</span>
            <span class="n">sol_path</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="nf">join</span><span class="p">(</span><span class="n">td</span><span class="p">,</span> <span class="sh">"</span><span class="s">solution.py</span><span class="sh">"</span><span class="p">)</span>
            <span class="n">test_path</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="nf">join</span><span class="p">(</span><span class="n">td</span><span class="p">,</span> <span class="sh">"</span><span class="s">test_solution.py</span><span class="sh">"</span><span class="p">)</span>

            <span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="n">sol_path</span><span class="p">,</span> <span class="sh">"</span><span class="s">w</span><span class="sh">"</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="sh">"</span><span class="s">utf-8</span><span class="sh">"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
                <span class="n">f</span><span class="p">.</span><span class="nf">write</span><span class="p">(</span><span class="n">prompt</span><span class="p">)</span>
                <span class="n">f</span><span class="p">.</span><span class="nf">write</span><span class="p">(</span><span class="n">completion</span><span class="p">)</span>
                <span class="n">f</span><span class="p">.</span><span class="nf">write</span><span class="p">(</span><span class="sh">"</span><span class="se">\n</span><span class="sh">"</span><span class="p">)</span>

            <span class="c1"># Run the dataset's tests against the candidate function.
</span>            <span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="n">test_path</span><span class="p">,</span> <span class="sh">"</span><span class="s">w</span><span class="sh">"</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="sh">"</span><span class="s">utf-8</span><span class="sh">"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
                <span class="n">f</span><span class="p">.</span><span class="nf">write</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">from solution import </span><span class="si">{</span><span class="n">entry_point</span><span class="si">}</span><span class="s"> as candidate</span><span class="se">\n</span><span class="sh">"</span><span class="p">)</span>
                <span class="n">f</span><span class="p">.</span><span class="nf">write</span><span class="p">(</span><span class="n">test_code</span><span class="p">)</span>
                <span class="n">f</span><span class="p">.</span><span class="nf">write</span><span class="p">(</span><span class="sh">"</span><span class="se">\n</span><span class="sh">"</span><span class="p">)</span>
                <span class="n">f</span><span class="p">.</span><span class="nf">write</span><span class="p">(</span><span class="sh">"</span><span class="s">check(candidate)</span><span class="se">\n</span><span class="sh">"</span><span class="p">)</span>

            <span class="k">try</span><span class="p">:</span>
                <span class="n">proc</span> <span class="o">=</span> <span class="n">subprocess</span><span class="p">.</span><span class="nf">run</span><span class="p">(</span>
                    <span class="p">[</span><span class="sh">"</span><span class="s">python</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">-m</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">pytest</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">-q</span><span class="sh">"</span><span class="p">,</span> <span class="n">test_path</span><span class="p">],</span>
                    <span class="n">cwd</span><span class="o">=</span><span class="n">td</span><span class="p">,</span>
                    <span class="n">capture_output</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
                    <span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
                    <span class="n">timeout</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span>
                <span class="p">)</span>
                <span class="n">passed</span> <span class="o">=</span> <span class="n">proc</span><span class="p">.</span><span class="n">returncode</span> <span class="o">==</span> <span class="mi">0</span>
                <span class="n">rewards</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="mf">1.0</span> <span class="k">if</span> <span class="n">passed</span> <span class="k">else</span> <span class="mf">0.0</span><span class="p">)</span>
            <span class="k">except</span> <span class="n">subprocess</span><span class="p">.</span><span class="n">TimeoutExpired</span><span class="p">:</span>
                <span class="n">rewards</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="mf">0.0</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">rewards</span>


<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">train_dataset</span> <span class="o">=</span> <span class="nf">load_dataset</span><span class="p">(</span><span class="sh">"</span><span class="s">openai/openai_humaneval</span><span class="sh">"</span><span class="p">,</span> <span class="n">split</span><span class="o">=</span><span class="sh">"</span><span class="s">test</span><span class="sh">"</span><span class="p">).</span><span class="nf">select</span><span class="p">(</span>
        <span class="nf">range</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
    <span class="p">)</span>

    <span class="n">model_name</span> <span class="o">=</span> <span class="sh">"</span><span class="s">Qwen/Qwen2.5-Coder-0.5B-Instruct</span><span class="sh">"</span>
    <span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="nf">from_pretrained</span><span class="p">(</span><span class="n">model_name</span><span class="p">)</span>
    <span class="n">model</span> <span class="o">=</span> <span class="n">AutoModelForCausalLM</span><span class="p">.</span><span class="nf">from_pretrained</span><span class="p">(</span>
        <span class="n">model_name</span><span class="p">,</span>
        <span class="n">torch_dtype</span><span class="o">=</span><span class="n">torch</span><span class="p">.</span><span class="n">float16</span><span class="p">,</span>
        <span class="n">device_map</span><span class="o">=</span><span class="sh">"</span><span class="s">auto</span><span class="sh">"</span><span class="p">,</span>
    <span class="p">)</span>

    <span class="n">config</span> <span class="o">=</span> <span class="nc">GRPOConfig</span><span class="p">(</span>
        <span class="n">output_dir</span><span class="o">=</span><span class="sh">"</span><span class="s">./train_out</span><span class="sh">"</span><span class="p">,</span>
        <span class="n">do_train</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
        <span class="n">per_device_train_batch_size</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
        <span class="n">gradient_accumulation_steps</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
        <span class="n">learning_rate</span><span class="o">=</span><span class="mf">5e-7</span><span class="p">,</span>
        <span class="n">num_train_epochs</span><span class="o">=</span><span class="mf">0.1</span><span class="p">,</span>
        <span class="n">num_generations</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span>
        <span class="n">generation_batch_size</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span>
        <span class="n">max_prompt_length</span><span class="o">=</span><span class="mi">256</span><span class="p">,</span>
        <span class="n">max_completion_length</span><span class="o">=</span><span class="mi">64</span><span class="p">,</span>
        <span class="n">temperature</span><span class="o">=</span><span class="mf">0.8</span><span class="p">,</span>
    <span class="p">)</span>

    <span class="n">trainer</span> <span class="o">=</span> <span class="nc">GRPOTrainer</span><span class="p">(</span>
        <span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span>
        <span class="n">args</span><span class="o">=</span><span class="n">config</span><span class="p">,</span>
        <span class="n">processing_class</span><span class="o">=</span><span class="n">tokenizer</span><span class="p">,</span>
        <span class="n">train_dataset</span><span class="o">=</span><span class="n">train_dataset</span><span class="p">,</span>
        <span class="n">reward_funcs</span><span class="o">=</span><span class="n">reward_fn</span><span class="p">,</span>
    <span class="p">)</span>

    <span class="n">trainer</span><span class="p">.</span><span class="nf">train</span><span class="p">()</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="sh">"</span><span class="s">__main__</span><span class="sh">"</span><span class="p">:</span>
    <span class="nf">main</span><span class="p">()</span>
</code></pre></div></div>

<p>Despite the simplicity and efficiency of GRPO, it is not a one-size-fits-all solution. RLVR with GRPO works well for problems that can be deterministically verified, but not for subjective or open-ended tasks like improving prose.</p>

<p>In the next post, I’ll walk through a GRPO training experiment for code generation and take a look at different aspects of the pipeline: dataset, reward functions, evals, and lessons learned. More soon.</p>
]]></content>
    
  </entry>
  
  <entry>
    
    <title type="html"><![CDATA[How the Lobsters front page works]]></title>
    <link href="http://blog.nilenso.com/blog/2026/01/20/lobsters-front-page/"/>
    
    <updated>2026-01-20T00:00:00+00:00</updated>
    <id>http://blog.nilenso.com/blog/2026/01/20/lobsters-front-page</id>
    <summary type="html"><![CDATA[Lobsters is a computing-focused community centered around link aggregation and discussion.
]]></summary>
    
    <content type="html"><![CDATA[<p><a href="https://lobste.rs">Lobsters</a> is a computing-focused community centered around link aggregation and discussion.</p>

<p>The <a href="https://github.com/lobsters/lobsters">code is open source</a>, so I had a look at how the front page algorithm works.</p>

<p>This is it:</p>

\[\textbf{hotness} = -1 \times (\text{base} + \text{order} \times \text{sign} + \text{age})\]

\[\text{hotness} \downarrow \implies \text{rank} \uparrow\]

<p>The page is sorted in ascending order by \( \textbf{hotness} \). The more negative the value of \( \textbf{hotness} \), the higher the story ranks.</p>

<p>You can skip straight to the <a href="#explore">interactive front page</a> to help get a feel for the front page dynamics.</p>

<h2 id="base">Base</h2>

<p>The \( \textbf{base} \) is added to the order term to incentivise certain types of posts, and influence the initial ranking. It is the sum of the hotness modifiers (a value between \( -10 \) and \( +10 \) of all the tags in that story).</p>

\[\textbf{base} = \sum_{t \in \text{tags}} \text{hotness_mod}_t + \begin{cases} 0.25 &amp; \text{if self-authored link} \\ \ 0 &amp; \text{otherwise} \end{cases}\]

<p>Some tags (like <code class="language-plaintext highlighter-rouge">culture</code> or <code class="language-plaintext highlighter-rouge">rant</code>) have negative “hotness modifiers”, which penalises their initial rank. Authors submitting their own content get a tiny boost, which is mildly surprising given the otherwise strict self-promo rules. The \( \textbf{base} \) has a modest effect on the hotness compared to \( \textbf{order} \) and \( \textbf{age} \).</p>

<h2 id="order">Order</h2>

<p>The value of \( \textbf{order} \) is derived from the engagement that a story gets.</p>

\[\textbf{order} = \log_{10}\left(\max\left(|\text{score} + 1| + \text{cpoints}, 1\right)\right)\]

<p>The progression of the order term is logarithmic—this means going from 0 to 100 votes increases the rank far more than going from 1000 to 1100 votes.</p>

<p>The \( \textbf{cpoints} \) is added to the story score, which accounts for non-submitter comment upvotes (a comment upvote is worth half a story upvote). If the \( \textbf{base} \) is negative (as is the case for a freshly submitted <code class="language-plaintext highlighter-rouge">rant</code>), then this term is zeroed, making the comments effectively contribute nothing to the rank.</p>

\[\text{comment_points} = \begin{cases} 0 &amp; \text{if } \text{base} &lt; 0 \\ \frac{1}{2}\sum(\text{comment_scores} + 1) &amp; \text{otherwise} \end{cases}\]

\[\textbf{cpoints} = \min(\text{comment_points}, \text{story_score})\]

<p>The \( \textbf{cpoints} \) can never exceed the story score. Therefore, stories that have a low score but lots of highly upvoted comments—perhaps a signature of controversy-generating low-quality submissions—do not get boosted by comment upvotes.</p>

<p>There are some details around merged stories that I am leaving out for the sake of simplifying this explanation. But it roughly does what you’d expect.</p>

<h2 id="sign">Sign</h2>

<p>If a story gets flagged enough to make the story score negatively (a flag is effectively a downvote), the \( \textbf{sign} \) becomes negative.</p>

\[\textbf{sign} = \begin{cases} -1 &amp; \text{if score} &lt; 0 \\ +1 &amp; \text{if score} &gt; 0 \\ 0 &amp; \text{otherwise} \end{cases}\]

<p>The \( \textbf{sign} \) doesn’t actually seem to do anything in practice! The \( \textbf{order} \) is always zero whenever the score is less than zero. That’s because comment points are clamped to \( \min(\text{score}, \text{cpoints}) \), so the log input never exceeds 1 for non‑positive scores. For positive scores, \( \textbf{sign} = 1 \) anyway. So in practice the \( \textbf{sign} \) term never changes the result.</p>

<h2 id="age">Age</h2>

<p>The value of \( \textbf{age} \) is fixed at the time of submission. This is the unix timestamp at which the story was created, divided by a configurable \( \textbf{hotness_window} \) time. The \( \textbf{hotness_window} \) is 22 hours by default—this means that the value of \( \textbf{age} \) increases by \( \text{1} \) unit every 22 hours.</p>

\[\textbf{age} = \frac{\text{created_at_timestamp}}{\text{hotness_window}}\]

<p>This value grows <strong>linearly</strong> with every newer story, pushing older stories down the rankings. The main tension in this algorithm is the fact that the \( \textbf{order} \) (dictated by score) grows <strong>logarithmically</strong>, so upvotes need to increase exponentially over time to counter the effect of \( \textbf{age} \) in order to stay on the front page. Father time comes for us all.</p>

<h2 id="in-a-nutshell">In a nutshell</h2>

\[\textbf{hotness} = -1 \times (\text{base} + \text{order} \times \text{sign} + \text{age})\]

\[\text{hotness} \downarrow \implies \text{rank} \uparrow\]

<p>Where \( \textbf{base} \) is initialised based on the tag and who submitted the story. The  \( \textbf{age} \) increases linearly for every new submission and the  \( \textbf{order} \) for a story, as determined by votes, increases logarithmically.</p>

<h2 id="explore">Explore</h2>

<p>Heads up—enable JavaScript to make this part work. This was mostly vibecoded, with me verifying that the results match the algorithm.</p>

<div style="width: 100vw; position: relative; left: 50%; right: 50%; margin-left: -50vw; margin-right:
 -50vw;">
     <iframe src="https://gisthost.github.io/?2a49b5d2fcb2960ebf3c2e04dd680a3a" width="100%" height="1600" style="border: none; max-width: 1100px; display: block; margin: 0 auto;"></iframe>
</div>

<p>There’s <a href="https://gisthost.github.io/?2a49b5d2fcb2960ebf3c2e04dd680a3a">a gisthost link</a> if you want to play with it as a standalone tool.</p>

<h2 id="thoughts">Thoughts</h2>

<p>The algorithm is solid. It allows new stories to get their time in the sun, and correctly penalises low-quality content that generates a lot of heated discussion. If there is heated discussion, it’s usually over highly-upvoted posts. Over time, age always dominates upvotes, so no story can really stick around that long in the front page. There are gates that stop overly flagged stories from making any progress up the ranks.</p>

<p>That said, I don’t think the algorithm really makes the site what it is. The character of the site is more the result of its opinionated moderation, narrow computing focus and the gradual acculturation through the invite system. Compared to many other forums, there’s less junk and also little outright hostility, racism, sexism or other isms. The community has surfaced lots of niche topics and writers, which I enjoy.</p>

<p>Yet, my experience on the website has been far from ideal. For me, this is rooted in a disconnect of values with the group most engaged on the site, whose votes and discussions drive the climate. I do not appreciate the cynicism worn with pride, the unproductive gotchas, the long polemics that reveal that the commenter hasn’t read beyond the title, the throwaway venting and the debates where it is clear that neither side wants to actually refine their world model. It has driven me away from engaging more on the site.</p>

<p>Studying the algorithm has shown me that disengaging would make my problem worse—a single user’s participation can be worth a lot. Early upvotes really count, and can easily boost a post to the front page. If you are lurking on the site and are unsatisfied, consider exercising your votes and submissions more. Post the more nuanced, friendly and curious comments that you’d like to see more of. It really does matter. I will likely change how I participate on the site as a result of this.</p>

<p>After all, there aren’t all that many relatively quiet and straightforwardly serious public forums to contrast the twitters and HNs of the world that can surface niche computing curiosities.</p>

<hr />

<p>This post originally appeared on <a href="https://atharvaraykar.com/lobsters/">Atharva’s personal site</a>.</p>

<hr />

<p><em>Update: Issued a correction from the sign term, thanks to <a href="https://lobste.rs/s/ngwloq/how_lobsters_front_page_works#c_jitoot">strongoose</a>.</em></p>

<hr />

<p>See discussion on <a href="https://lobste.rs/s/ngwloq/how_lobsters_front_page_works">Lobsters</a> and <a href="https://news.ycombinator.com/item?id=46669996">Hacker News</a>.</p>
]]></content>
    
  </entry>
  
  <entry>
    
    <title type="html"><![CDATA[Minimum Viable Benchmark]]></title>
    <link href="http://blog.nilenso.com/blog/2025/11/28/minimum-viable-benchmark/"/>
    
    <updated>2025-11-28T00:00:00+00:00</updated>
    <id>http://blog.nilenso.com/blog/2025/11/28/minimum-viable-benchmark</id>
    <summary type="html"><![CDATA[A few months ago, I was co-facilitating a “Birds of a Feather” session on keeping up with AI progress. This was a group of engineering leaders and ICs.
]]></summary>
    
    <content type="html"><![CDATA[<p>A few months ago, I was co-facilitating <a href="https://hasgeek.com/fifthelephant/2025/sub/birds-of-feather-bof-session-finding-signal-in-a-n-8hrRznGe3qf6e7zXxKwcDi">a “Birds of a Feather” session</a> on keeping up with AI progress. This was a group of engineering leaders and ICs.</p>

<p>A big talking point was that popular public benchmarks are insufficient for determining if an AI model is a good fit for their product.</p>

<figure>
  <img src="/images/blog/bof.jpg" alt="Collage from the Birds of a Feather session showing co-facilitator Lavanya Tekumala, developers discussing benchmarks and AI-assisted coding, and a whiteboard featuring the word 'benchmark' three times" style="display: block; width: 70%; margin: 0 auto;" />
  <figcaption>clockwise: (1) My co-facilitator Lavanya Tekumala. (2) Developers talking about benchmarks and AI-assisted coding. (3) The whiteboard from the session which featured the word "benchmark" three times.</figcaption>
</figure>

<p>I want to sharpen this observation a bit more.</p>

<h2 id="what-are-benchmarks-useful-for">What are benchmarks useful for?</h2>

<p>I’ve seen benchmarks serve a whole range of purposes.</p>

<ul>
  <li><strong>Benchmarks as decision-making tools</strong>: You look at existing benchmarks to figure out whether to use model A or model B.</li>
  <li><strong>Benchmarks as regression markers</strong>: Like unit tests, they tell you if your updated AI model or system isn’t doing worse than before. This is especially useful in cost-optimisation exercises.</li>
  <li><strong>Benchmarks as improvement indicators</strong>: If you see benchmark go up, you can tell that your change to the model or system is improving the outcome.</li>
  <li><strong>Benchmarks as product behaviour feedback</strong>: A more subtle use—with the right analysis of trajectories, benchmarks can tell you about the strengths and weaknesses of your model across categories of tasks you are interested in.</li>
  <li><strong>Benchmarks as research agenda setters</strong>: When a new benchmark is published, AI labs start hill-climbing on it—publishing benchmarks is a great way to influence what AI is good at.</li>
  <li><strong>Benchmarks as RL environments</strong>: This is an emerging use case. Reinforcement Learning with Verifiable Rewards effectively works with a setup that doesn’t look all that different from a benchmark.</li>
  <li><strong>Benchmarks as forecasting anchors</strong>: You can use benchmarks to get a sense of how AI capabilities are progressing over time. <a href="https://evaluations.metr.org/gpt-5-1-codex-max-report/#extrapolating-on-trend-improvements-in-next-6-months">METR</a> has made <a href="https://joel-becker.com/images/publications/forecasting_time_horizon_under_compute_slowdown.pdf">good use</a> of this.</li>
</ul>

<p>If a benchmark is not helping you with any of the above, your benchmark is useless. Many useless benchmarks unfortunately exist.</p>

<h2 id="benchmark-traps">Benchmark traps</h2>

<p>Here’s the Artificial Analysis Intelligence index, which aggregates all sorts of AI benchmarks.</p>

<p><img src="/images/blog/aa-intelligence-index.png" alt="" /></p>

<p>And here’s the most popular benchmark for testing coding ability.</p>

<p><img src="/images/blog/swe-bench-chart-2025-11-27.png" alt="" /></p>

<p>These charts in isolation give the impression that AI models are pretty interchangeable and that whenever a new model comes in, you can reap the fruits of the wonderful frontier lab training pipelines. All you need to do is to switch your coding model to whatever the new hotness is. Right?</p>

<p>No.</p>

<p>The issue with benchmarks is that they are lossy. They condense multidimensional characteristics into a single number<sup id="fnref:gleechpaper" role="doc-noteref"><a href="#fn:gleechpaper" class="footnote" rel="footnote">1</a></sup>. Your business case may not look like whatever your number represents.</p>

<p>Let’s take an example. You’re working on an AI agent that operates in the legal domain. A profoundly unserious approach would be to look at which model is doing well across standard benchmarks (like the intelligence index above) and pick that. If we put a couple of extra brain cells to work, we might look at an independent benchmark score for the most popular legal benchmark. Right now this is LegalBench.</p>

<p><img src="/images/blog/legalbench.png" style="width: 50%" /></p>

<p>Great, so it’s still the state-of-the-art Gemini 3 Pro, isn’t it? It’s clearly #1 on the benchmark<sup id="fnref:statsig" role="doc-noteref"><a href="#fn:statsig" class="footnote" rel="footnote">2</a></sup>.</p>

<p>But look at this—there’s a CaseLaw (v2) benchmark as well.</p>

<p><img src="/images/blog/caselawv2.png" style="width: 50%" /></p>

<p>No Gemini 3 Pro in sight. Have they forgotten to bench our frontier Gemini model here? Actually no.</p>

<p>Gemini 3 Pro is poor enough at this benchmark that it’s nowhere near the top of the leaderboard. In fact, it ranks #39 and is worse than the previous-generation Gemini 2.5 Flash!</p>

<p>Both of these are measuring different things in the legal domain, with CaseLaw appearing more like real-world legal work, and LegalBench being more like an academic exam. It’s quite possible that Gemini can be good at some parts of some domains and poor at other parts of the same domain. Or maybe the CaseLaw evaluation has some unaddressed issues (after all, there seem to be a lot of surprising results in the leaderboard). Or that Gemini hates Canadians.</p>

<p>This all points to one thing—don’t base your decision off benchmark scores. Instead, look at the benchmark contents and methodology, figure out how closely it aligns with what tasks you are handing off to the AI and most importantly, <strong>make your own internal benchmark</strong>, with <strong>metrics aligned to your business case</strong><sup id="fnref:nilbench" role="doc-noteref"><a href="#fn:nilbench" class="footnote" rel="footnote">3</a></sup>.</p>

<figure>
  <img src="/images/blog/eqtweet.png" style="display: block; width: 70%; margin: 0 auto;" />
  <figcaption>Another reason to have internal benchmarks. Not all new models may be better than what came before for your use case.</figcaption>
</figure>

<h2 id="minimum-viable-benchmark">Minimum viable benchmark</h2>

<p>Without getting into the weeds of categorisations, I’d note that internal benchmarks are not all that different from what all the hip and cool new AI Engineering teams like to call <em>evals</em>.</p>

<p>They are not structurally different from public benchmarks. You have your dataset of tasks. You (ideally) have your ground truth for these tasks. You measure your AI system against these tasks and get scores. Unfortunately, building a public benchmark is hard work—you have to collect a lot of data to get signal<sup id="fnref:statsigcount" role="doc-noteref"><a href="#fn:statsigcount" class="footnote" rel="footnote">4</a></sup>, ensure the environments are reproducible and your metrics trustworthy. This <a href="https://www.lesswrong.com/posts/EFQ3F6kmt4WHXRqik/ugh-fields">ugh field</a> has pushed teams away from building evals. Well, at least until it’s too late, when you suddenly have everyone scrambling to do the grunt work of collecting annotated high-quality data when the house is burning.</p>

<p>I’d like to propose an alternate view—your internal benchmarks don’t need to be as sophisticated as the public benchmarks. They only have to be a <em>minimum viable benchmark</em>.</p>

<p>A minimum viable benchmark is not concerned with being an arena for competing AI systems—it is a vehicle for figuring out <em>whether you are building the right product</em> and that the product works well.</p>

<p>You don’t need to have an intelligent-sounding metric or your LLM eval SaaS vendor figured out in order to get started. You only need to collect your data and annotate it. You can get started and make a lot of progress in a couple of hours, armed with only a spreadsheet and your product and engineering teams in one room.</p>

<p>In your sheet, ensure you have your inputs to your AI system. Add the outputs after a few runs in the system for the tasks you need<sup id="fnref:notevensystem" role="doc-noteref"><a href="#fn:notevensystem" class="footnote" rel="footnote">5</a></sup>. Add free-form commentary in the last column about how it did. Don’t optimise anything yet. Don’t add any “metrics” yet.</p>

<p>After this exercise, a few things happen:</p>

<ul>
  <li>You realise what your task is actually like and what it might involve.</li>
  <li>You realise whether the AI works at all for your task.</li>
  <li>You realise what it feels like to be a user of your system and get a better sense of where AI is actually helping. This is input for the product team.</li>
  <li>You realise what actually needs to be measured for your benchmark metrics. It’s never the vague, pointless metrics that came with the eval framework you were looking at.</li>
  <li>The valuable metrics inferred from this exercise are often useful product metrics!</li>
  <li>You catch the biggest blind spots of the AI system very early on. Gathering large datasets is needed only when you are trying to catch <em>small effects</em>. Early on, most of observed effects on any intervention will be quite large!</li>
  <li>Most importantly, you have overcome the Ugh Field! This exercise is often fun.</li>
</ul>

<p>This minimal viable benchmark would have already proven its usefulness early on. Everyone in your team will continue to build on top of this and rely on it when, inevitably, you have to avoid regressions, evaluate a new feature or model or optimise costs. Over time, your minimal viable benchmark can grow into a useful, strong benchmark that forms the backbone of your AI project.</p>

<p>How we go from a minimal viable benchmark to a maximally useful benchmark would perhaps need its own article. But to give you a taste, ensure you have these properties:</p>

<ul>
  <li>It’s easy to look at the data and your cross-functional team is involved in reviewing the data regularly.</li>
  <li>What you are measuring maps to product outcomes—this may not be the case for public benchmarks.</li>
  <li>There are enough samples to actually give you a sense of whether your system has actually improved.</li>
  <li>The tasks have a difficulty ramp-up to actually capture improvements to models and systems. If most of your tasks have the same difficulty, and a newly released AI model gains the ability to do that task, your benchmark would get saturated overnight and cease to capture further improvements.</li>
  <li>The metrics are measured either deterministically or with an unbiased estimator<sup id="fnref:llmjudge" role="doc-noteref"><a href="#fn:llmjudge" class="footnote" rel="footnote">6</a></sup>.</li>
</ul>

<h2 id="anyway">Anyway,</h2>

<ul>
  <li>Don’t trust public benchmark numbers without seeing if the methodology and numbers map to your product outcomes.</li>
  <li>Build your own minimal viable benchmark, where what you are measuring maps to your product’s quality.</li>
  <li>It’s not that hard to start with, and it’s really worth it.</li>
</ul>

<hr />

<h2 id="footnotes">Footnotes</h2>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:gleechpaper" role="doc-endnote">
      <p>As I was writing this article, I came across Gavin Leech’s <a href="https://www.gleech.org/paper#:~:text=Even%20less%20generalisation">Paper AI Tigers</a> which goes deeper into all the ways in which benchmarks fail to generalise on other tasks. <a href="#fnref:gleechpaper" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:statsig" role="doc-endnote">
      <p>Clearly #1 by a statistically insignificant amount. I’ve almost never seen anyone reason about whether the score differential is due to random noise or an actual effect. <a href="#fnref:statsig" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:nilbench" role="doc-endnote">
      <p>Our internal benchmark for <a href="/blog/2025/09/15/ai-unit-of-work/">StoryMachine</a> has already caught on to the fact that Sonnet 4.5 is a lousy User Acceptance Tester compared to GPT-5. This is not something that would have been obvious from public benchmarks. When Opus 4.5 came out, I was able to immediately run the benchmark and confirm that there was indeed an improvement on that front. This becomes critical as the models get smarter and <a href="https://simonwillison.net/2025/Nov/24/claude-opus/#:~:text=The%20frontier,Diamond">it gets harder to figure out what they are good at</a>. <a href="#fnref:nilbench" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:statsigcount" role="doc-endnote">
      <p>Chip Huyen’s AI Engineering book brought this handy heuristic chart to my attention—this works well for binary classification evals (it’s made some assumptions about the data being somewhat independent, so treat it more like a heuristic)</p>

      <p><img src="/images/blog/huyenheuristic.png" alt="" /> <a href="#fnref:statsigcount" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:notevensystem" role="doc-endnote">
      <p>Sometimes, you don’t need a working system at all—if your use case supports it, I sometimes just paste the prompt we would use to ChatGPT or Claude. Or if the work is more “agentic”, I’d send it to Claude Code or OpenHands. <a href="#fnref:notevensystem" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:llmjudge" role="doc-endnote">
      <p>I have seen a lot of LLM-as-a-judge setups that are quite unprincipled and do not address the rather basic question of “who judges the judge?”. To date, I have found only two principled ways to do this—the first is Eugene Yan’s <a href="https://eugeneyan.com/writing/product-evals/">Product Evals Recipe</a>, where you measure the judge’s agreement with human annotations, and align the judges accordingly. The other one is <a href="https://arxiv.org/abs/2511.21140v1">this paper</a> which proposes a statistically sound way to to report LLM judge metrics with bias adjusted accuracy and confidence intervals. Both these approaches are complementary. <a href="#fnref:llmjudge" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>
]]></content>
    
  </entry>
  
  <entry>
    
    <title type="html"><![CDATA[How to work with Product: Taste and Adjust]]></title>
    <link href="http://blog.nilenso.com/blog/2025/11/26/how-to-work-with-product-taste-and-adjust/"/>
    
    <updated>2025-11-26T00:00:00+00:00</updated>
    <id>http://blog.nilenso.com/blog/2025/11/26/how-to-work-with-product-taste-and-adjust</id>
    <summary type="html"><![CDATA[
  Eh! All of you, come here! Taste it! Taste it! Taste it! Taste it!

  Gordon Ramsay

]]></summary>
    
    <content type="html"><![CDATA[<blockquote>
  <p>Eh! All of you, come here! Taste it! Taste it! Taste it! Taste it!</p>

  <p><strong><em>Gordon Ramsay</em></strong></p>
</blockquote>

<p>If you want to cook a great dish, you’ve got to taste it every step of the way. Taste the ingredients you buy, the components you prepare, and the spices and seasonings. If you can’t taste it, you smell it, feel it, or listen to it. And then you adjust. Taste and adjust until you create a dish you like.</p>

<p><em>“Taste and adjust”</em> is a form of <a href="https://en.wikipedia.org/wiki/Continual_improvement_process">continuous improvement</a> applied to the creation of food. The hallmark methodologies of the <a href="https://en.wikipedia.org/wiki/Scientific_method">scientific method</a>, <a href="https://en.wikipedia.org/wiki/Kaizen">Kaizen</a>, <a href="https://en.wikipedia.org/wiki/Toyota_Production_System">TPS</a>, <a href="https://en.wikipedia.org/wiki/PDCA">PDCA</a>, <a href="https://en.wikipedia.org/wiki/Test-driven_development">TDD</a>, <a href="https://www.gv.com/sprint/">design sprints</a>, or <a href="https://en.wikipedia.org/wiki/Extreme_programming">extreme programming</a>, that have led to some of humanity’s best creations, are all forms of continuous improvement. At their core is this principle:</p>

<blockquote>
  <p>Creators need an immediate connection to what they’re creating.</p>

  <p><strong><em>Bret Victor, Inventing on Principle</em></strong></p>
</blockquote>

<p>Bret Victor says that <em>“working in the head doesn’t scale”</em>, and that <a href="https://worrydream.com/LearnableProgramming/">understanding comes from seeing data, flow, and state directly</a>. When building products, can you see the data, flow, and state directly? Can you “taste” your product every step to ensure it’s exactly what you and your users want?</p>

<p>The chef’s line-tasting, our flywheel, harness, environment, or feedback loop, is the framework in which we apply this principle to product creation. The product and engineering functions must build and maintain this flywheel together, every step of the way.</p>

<p><img src="/images/blog/taste-and-adjust.png" alt="taste-and-adjust" /></p>

<h2 id="the-product-development-flywheel">The product development flywheel</h2>

<div style="border-left:4px solid #FF3D84; background:#fafafa; padding:0.75em 1em; border-radius:4px; margin:1em 0;">

  <p>To build the flywheel, we ask:</p>

  <ul>
    <li>“What is the simplest experiment I can run to validate this hypothesis?”, and then</li>
    <li>“What do I need to run this experiment?”</li>
  </ul>

</div>

<p>The machinery that enables running such experiments frequently and quickly is the flywheel.</p>

<p>It could be in the form of an operator’s console that allows product to tweak config on the fly, or building a prototype, or a feature-flag allowing tests with beta-users, or publishing a new metric that removes a blind spot. Even unit tests that verify whether the code does what product intends are part of this flywheel.</p>

<p>While this seems like a simple enough principle to apply, in reality, we are faced with the inherent complexity of working with many people, roles, and tools. A typical product development lifecycle (PDLC) looks like the abstract machine shown below. Each phase has controls and measurements around specific feedback loops (such as Idea ⇄ User), and the phases are interconnected through reinforcing and balancing information channels.</p>

<div style="max-width:80%; margin: 0 auto">

  <p><img src="/images/blog/flywheel-an.png" alt="product-development-flywheel" /></p>

</div>

<p>Here’s a list of some ways to “taste” at each phase, and a healthy level of involvement of product and engineering in each of them.</p>

<div class="full-width-table">

  <table>
    <thead>
      <tr>
        <th>Phase</th>
        <th>Feedback Loop</th>
        <th>Feedback tools (ways to taste, smell, or touch)</th>
        <th>Healthy involvement %</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>1. Explore</td>
        <td>Idea ⇄ User</td>
        <td>Pen + Paper, User Research, Design Sprints, Landing Pages, Campaigns</td>
        <td>90% Product,<br /> 10% Engineering</td>
      </tr>
      <tr>
        <td>2. Validate</td>
        <td>Hypothesis ⇄ User</td>
        <td>Wireframes, Prototypes, Proofs of Concept</td>
        <td>70% Product,<br /> 30% Engineering</td>
      </tr>
      <tr>
        <td>3. Plan</td>
        <td>Idea ⇄ Spec</td>
        <td>Thin slices of work, Experiments, Spikes, Tracing Bullets</td>
        <td>50% Product,<br /> 50% Engineering</td>
      </tr>
      <tr>
        <td>4. Develop</td>
        <td>Spec ⇄ Code</td>
        <td>TDD, Types, Compilation, REPL, AI Assisted Coding</td>
        <td>10% Product,<br /> 90% Engineering</td>
      </tr>
      <tr>
        <td>5. Integrate and release</td>
        <td>Code ⇄ Product</td>
        <td>Previews, Devboxes, Staging, Integration, Quality Analysis</td>
        <td>30% Product,<br /> 70% Engineering</td>
      </tr>
      <tr>
        <td>6. Operate</td>
        <td>Product ⇄ User</td>
        <td>Product Observability, Operator Consoles, Alerts</td>
        <td>50% Product,<br /> 50% Engineering</td>
      </tr>
    </tbody>
  </table>

</div>

<h3 id="fine-tuning-the-flywheel">Fine-Tuning the Flywheel</h3>

<ul>
  <li><strong>Get end-to-end product builders:</strong> You want teams that go together from phase 1 to 6, and then around again, to close the loop on their creation. Look for roles siloed in fewer phases, and work to involve them in all phases.</li>
  <li><strong>Get involved early:</strong> Phases 1 and 2 are the ideation phase, and the most important thing to do here is to listen, and understand the problem deeply. I wrote about this <a href="https://blog.nilenso.com/blog/2025/11/18/how-to-work-with-product-at-the-tea-table/">earlier in the series</a>. Building this phase of the flywheel for new products is cheap, especially with vibe coding. However, keeping experimentation costs low as the product matures can be challenging. Work to keep experiments cheap by using feature flags, or by maintaining experimental or forked versions of applications.</li>
  <li><strong>Get closer to the user:</strong> Phases 3, 4, and 5 make up the typical SDLC (software development lifecycle), and in my experience, engineering is less involved in phases 1, 2 and 6. This is unfortunate because phases 1, 2, and 6 interface with the user and house the most important feedback loops.</li>
  <li><strong>Planning &gt; Speed:</strong> Development (phase 4) is arguably the most expensive part of most tech companies. While there’s a lot of focus on making development <em>faster</em> to reduce costs, <em>reducing</em> work through <a href="https://blog.nilenso.com/blog/2025/09/17/the-common-sense-unit-of-work/#planning">planning</a> (phase 3) is far more effective. Break down problems, find thin slices of work to serve, and prioritise ruthlessly.</li>
  <li><strong>Close outer feedback loops:</strong> Phase 6 should close the loop on business <a href="https://blog.nilenso.com/blog/2025/11/21/how-to-work-with-product-towards-what-port-do-you-sail/">goals</a> through product observability, in addition to the local feedback loops of individual features or initiatives.</li>
</ul>

<div style="border-left:4px solid #FF3D84; background:#fafafa; padding:0.75em 1em; border-radius:4px; margin:1em 0;">

  <p>**Stronger flywheel ⇒ Immediate connection ⇒ Better product**</p>

</div>

<p>So, review your flywheel periodically. Lubricate the gears, and tighten the feedback loops. Ultimately, ensure that everyone on the team feels empowered to stop the line, take a spoonful, and say, <em>“Needs more salt.”</em></p>

<div style="display: flex; justify-content: center; align-items: center">

  <p><img src="/images/blog/soup-salt-pepper.png" alt="" /></p>

</div>
]]></content>
    
  </entry>
  
  <entry>
    
    <title type="html"><![CDATA[How to work with Product: To What Port Do You Sail?]]></title>
    <link href="http://blog.nilenso.com/blog/2025/11/21/how-to-work-with-product-towards-what-port-do-you-sail/"/>
    
    <updated>2025-11-21T00:00:00+00:00</updated>
    <id>http://blog.nilenso.com/blog/2025/11/21/how-to-work-with-product-towards-what-port-do-you-sail</id>
    <summary type="html"><![CDATA[
  If one does not know to which port one is sailing, no wind is favorable.

  Letter 71, Moral Letters to Lucilius, Seneca

]]></summary>
    
    <content type="html"><![CDATA[<blockquote>
  <p>If one does not know to which port one is sailing, no wind is favorable.</p>

  <p><em><a href="https://en.wikisource.org/wiki/Moral_letters_to_Lucilius/Letter_71">Letter 71, Moral Letters to Lucilius, Seneca</a></em></p>
</blockquote>

<p>This might seem quite obvious, cliché even, but it’s surprisingly difficult to apply to life, or to building products. Making big decisions is hard. But then:</p>

<blockquote>
  <p>Hard choices easy life, easy choices hard life.</p>

  <p><em>Jerzy Gregorek</em></p>
</blockquote>

<p>When leadership fails to make the hard choices, the entire team ends up stuck, treading water. Most leaders I know recognise this and try to make those calls. But the day-to-day challenges inside teams are more subtle, and often look like this: <em>“There are two objectives; both are equally important.”</em></p>

<p>If we try to focus on everything, we focus on nothing, as John Doerr puts it in <a href="https://www.whatmatters.com/">Measure What Matters</a>. In the following excerpts from my experience, I’ll illustrate how these issues show up in everyday work.</p>

<p><img src="/images/blog/gemini-generated-image.png" alt="towards-what-port" /></p>

<h3 id="know-when-youre-treading-water">Know when you’re treading water</h3>

<hr />

<p>At <a href="http://simple.org/">Simple.org</a>, one of the recurring debates in the early years was whether we should support patient screening. Screening meant that nurses would travel to villages and towns and check entire communities for cardiovascular issues. While our work was focused on hypertension, once the nurses were already in the field, it felt incomplete to ignore diabetes. Screening promised a better understanding of population health, while our existing product was designed around longitudinal care of patients. At the time, this seemed like a high-level product call that engineering could stay out of, because we figured that core features such as patient search and follow up scheduling were needed no matter what… but the ambiguity kept creeping in. Patient search behaved predictably inside a clinic because it showed patients registered to that facility, but in the field during screening, we had no shared understanding of what results a search should return. Even entering a blood pressure reading created confusion because the clinic workflow called for follow up reminders while screening did not. These questions came up repeatedly, each causing a hiccup or delay in progress. And they added up.</p>

<hr />

<p>A similar scenario unfolded when I worked on building <em>order-pooling</em> at Gojek, and it’s perhaps easier to see in two/three sided markets. <em>Pooling</em> is where one driver picks up multiple orders from the same restaurant and delivers them to customers who live near one another. Seems like a useful feature, but useful for whom? If the priority was higher utilisation for drivers, then we needed to focus on improving their income and increasing the number of orders they could complete each hour. Customer experience mattered, but only up to the point where it did not slow drivers down. If the priority was increased customer demand, questions of pricing, ability to opt-out of pooling, and the impact on wait times became far more important. We assumed we could make progress without choosing, because, on the surface, the feature looked generic enough to support both paths. But as we made decisions about pricing rules, batching behaviour, driver assignments, and customer communication, the lack of clarity surfaced again and again. Many small decisions depended on a choice we had not made, and a project that looked like three months of straightforward work grew into four.</p>

<hr />

<p>I once joined a team as a fixer, because the CTO believed that nothing meaningful had been delivered for almost a year. On my first day, I asked about the team’s objectives, and I received  vague, hand-wavy answers. The director of engineering said the PMs had never given them dashboards or metrics to work with. The PM, sitting in the same room, said they had plenty of dashboards but the engineers showed no interest in them. Product kept reporting metrics to leadership, and those metrics were flat. Engineering kept reporting completed initiatives, and there were plenty of them, but no one realised why the work was not improving the outcomes that mattered. The team also held several engineering-only meetings every week, intentionally excluding PMs because their presence was considered unnecessary for technical discussions. All I did was highlight this rift to leadership. And the biggest shift in morale came in a single meeting when leadership finally presented one clear north star metric. Half the ongoing initiatives were dropped on the spot because they did not support that goal. We also removed the long list of engineering-only meetings and replaced them with two focused weekly sessions that included product, engineering, design, and QA together. Once everyone saw the same goal and talked in the same room, the team began to move forward again.</p>

<hr />

<h3 id="tugging-on-the-mainsheet">Tugging on the mainsheet</h3>

<p>Asking junior developers, or associate PMs simple questions about their work can reveal how clear the goals are, and how aligned everyone in the team is.</p>

<ul>
  <li><em>“Why are you working on feature X?”.</em> Ideally, they would open up their task / unit-of-work description, point to the observability section, and navigate to the corresponding leading and lagging metrics on a live dashboard that everyone in the team uses. Quite often though, the answers are <em>“because the PM told me to”</em>, <em>“It’s the most important feature right now”, “people are complaining about this”</em>, etc. Which are not <em>wrong</em> reasons per se, but not nearly as specific as we need to be.</li>
  <li><em>“What was the impact of the last initiative you delivered?”</em>. This gets fewer good answers in my experience, despite it being one of the most important and documented aspects of satisfaction with one’s work.</li>
  <li><em>“Is there something more important you should be working on?</em>”. If the work in the team is transparently prioritised as per the objectives, you would get a straightforward answer reflecting that. And in my experience, this often gets murky answers like <em>“Maybe feature Y is more important, but we stopped that work for some reason”,</em> or <em>“They say Z is more important, but X is what we need right now, and I’m not sure why”.</em></li>
</ul>

<p>Ideally, every single initiative should move the metric, and we respond by reinforcing efforts, or correcting course. In order to do that, every slice of work should incorporate building observability for it. And to do that, the product requirements should be clear about what metrics are expected to move, and why. It is engineering’s responsibility to seek that clarity out.</p>

<hr />

<h3 id="without-involvement-there-is-no-commitment">Without involvement, there is no commitment</h3>

<blockquote>
  <p>“Without involvement, there is no commitment. Mark it down, asterisk it, circle it, underline it. No involvement, no commitment.”</p>

  <p><em>Habit 2, Begin with the End in Mind, Stephen Covey</em></p>
</blockquote>

<p>This quote bears repeating in teams where product and engineering fail to work together. Without involvement from engineering in setting the goals, and understanding the metrics, there is no commitment to create the required observability, and the feedback loops necessary to work toward them.</p>

<p>In my experience, engineering is seldom involved in creating OKRs, and it’s usually owned by product and business teams. This is antithetical to forging a deep relationship between the functions. If you’re a leader, use your influence to bring engineering to the table. If you don’t have the influence yet, build it by creating value around the objectives anyway. Understand them, challenge them, and help make them real. Comment on the OKR docs with your views, push to refine that metric you think is poorly defined, or negotiate on that goal you think is too steep. It’s hard, but you have to work your way up to the table.</p>

<h3 id="what-should-i-do-as-an-engineer">What should I do as an engineer?</h3>

<p>Work with product to get clarity on the port. Once you have it, as Seneca suggests, you must steer and watch the stars. It boils down to three simple things:</p>

<ol>
  <li>Demand clarity on goals, ideally in terms of metrics you can track, and participate in the conversations needed to achieve that clarity.</li>
  <li>Make product observability a first-class requirement with every <a href="https://blog.nilenso.com/blog/2025/09/17/the-common-sense-unit-of-work/">unit of work</a>.</li>
  <li>Establish balancing and reinforcing feedback loops in day-to-day work that tie back to those metrics.</li>
</ol>

<p><img src="/images/blog/product-tea-table-image-1-.png" alt="portless-boat" /></p>
]]></content>
    
  </entry>
  
  <entry>
    
    <title type="html"><![CDATA[How to work with Product: At the Tea Table]]></title>
    <link href="http://blog.nilenso.com/blog/2025/11/18/how-to-work-with-product-at-the-tea-table/"/>
    
    <updated>2025-11-18T00:00:00+00:00</updated>
    <id>http://blog.nilenso.com/blog/2025/11/18/how-to-work-with-product-at-the-tea-table</id>
    <summary type="html"><![CDATA[The most impactful and delightful work I’ve done as an engineer, has been on projects where I had a great relationship with the product manager(s). Poorer the relationship, poorer the impact and experience.
]]></summary>
    
    <content type="html"><![CDATA[<p><em>The most impactful and delightful work I’ve done as an engineer, has been on projects where I had a great relationship with the product manager(s). Poorer the relationship, poorer the impact and experience.</em></p>

<p><em>This relationship between people reflects the relationship between problem and solution. In real life, problems and solutions evolve together. Each one affects the other. Similarly, the people involved must grow a meaningful relationship together. They need to listen empathically, understand needs, comprehend each other’s writing, create artefacts together, communicate well, and have fun solving problems together.</em></p>

<p><em>Since I have predominantly played the engineering role, I’m writing to fellow engineers based on that experience. This series of posts details when and how engineering can work with product to create the synergy that powers wonderful products.</em></p>

<p><img src="/images/blog/product-tea-table-image.png" alt="product-engineering-tea-table" /></p>

<p>We reflect on our problems over tea. Or coffee. Or <a href="https://www.urbandictionary.com/define.php?term=sutta">sutta</a>, if you’re into that. It’s a casual conversation that’s not bound by the colder confines of an office or work space, where we’re calm, set to truly listen and receive. We allow ourselves to be courageously vulnerable with our real thoughts and ideas. Reflection provides fertile ground for germination of ideas. It’s in this mental space that people often understand, empathise, and find camaraderie. These are flaps of the butterfly wings, or the rolls of that snowball we build as a product team.</p>

<p>When someone who knows the problem very well speaks about it, even casually, lean in. <a href="https://en.wikipedia.org/wiki/Rogerian_argument">Listen empathically</a>. Diagnose before you can prescribe. Once you have a deep enough understanding of the problem, and the needs of the people therein, you can start thinking about how your engineering skills can add value.</p>

<p>What follows are some autobiographical experiences. Read on, and be inspired! Yeah, no, I’m simply illustrating what I mean, drawing on my experiences. If you’d rather not read my life stories, feel free to skip ahead to the list at the end.</p>

<hr />

<p>It was in this kind of tea-table space that we found the courage to say <em>“screw it, lets go <a href="https://www.simple.org/blog/offline-first-apps/">offline first</a>”</em>, with our initial efforts at <a href="http://simple.org">simple.org</a>. <a href="https://x.com/dburka">Daniel</a> had spent months in the field with nurses, speaking to true experts, and understanding the problems of treating hypertension in public health. He then expressed to us, the engineering team, a deep need for nurses in rural areas to be independent of internet requirements. We understood that need, and decided it’s worth a try with a “how hard can it be” attitude. After some 6 years, we still see that as one of the most important decisions to make the product successful, and truly useful to thousands of health care workers and millions of patients.</p>

<hr />

<p>After a hearty lunch one day on the nilenso terrace, we were talking about how we need better mechanisms to validate ideas with nurses in the field. <a href="https://x.com/_droov">Dhruv</a> said <em>“Ideally, I just want to be a fly on the wall, and observe the nurse use the app with real patients”</em>. Prototypes didn’t cut it, and user-research questions were based on possibilities that weren’t easy to communicate. I suggested building a separate app that we can use to <a href="https://www.youtube.com/watch?v=1FqOF9P0nEY">run experiments quickly</a> in the field. This would normally take months with a small team (it was 2018), but I had built prototype-apps quickly before, and could confidently say <em>“Give me a couple of weeks, no one else needs to be involved and their work can continue”</em>. That was a small enough risk the leadership could take, and it paid off. We iterated on scanning UUIDs of various densities using low-spec phones, various error feedback mechanisms, and a custom number keypad for entering blood pressures, tried all these out with nurses in the field, and then built them into the actual app.</p>

<hr />

<p>At <a href="https://gofood.co.id/en">GoFood</a>, while we were having dinner in Jakarta I noticed <a href="https://www.linkedin.com/in/hareeshwar-g-86240712/">Hareesh</a> kept returning to what seemed like a small issue: customers accidentally placing orders and immediately cancelling them. It didn’t appear urgent, but it was quietly blocking a major release: restaurants that auto-accepted orders the moment they were placed. He never said this directly, but his fixation was the signal. I asked, <em>“How does Zomato do this?”</em>, and when we checked their app, we found a straightforward pattern: a 10-second confirmation timer and a clear message about non-cancellable orders. Because the app already had all the information needed to decide when to show that timer, I could confidently say, <em>“This can be an app-only change; we can ship it in a few days.”</em> The engineering work was trivial. Identifying that it was a blocker and needed a quick solve was the real contribution.</p>

<hr />

<p>While working at a logistics firm recently, we spent more than a month wrestling with the onboarding journeys for new partners. It was a legacy system where every bug fix spawned new ones, and understanding the possible solutions was even more painful than understanding the problems. Our competitors could onboard new partners in hours while we took days, and the decade-old architecture simply wouldn’t let the product evolve. One evening, after venting about all this with the head of product, I asked him, <em>“If we were to re-implement the entire onboarding flow from scratch, what would you do?”</em>, and it turned out he had been thinking the same thing but couldn’t voice it unprompted. It was too big and too costly a leap to suggest outright. That question unlocked the real conversation: by the end of it, we had enough product clarity and engineering rationale to justify a half-year rewrite.</p>

<hr />

<p>I’ve tried to break down my instincts from such experiences at the ideation phase, to what I might tell another fellow engineer:</p>

<ul>
  <li><strong><a href="https://en.wikipedia.org/wiki/The_7_Habits_of_Highly_Effective_People#Seek_first_to_understand,_then_to_be_understood">Seek first to understand</a>.</strong> Listen to the users. Listen to subject matter experts. Listen to people who understand the problems and product very well. Pay attention to emotional cues and subtext. They often point to deeper problems than any ticket or spec reveals. Create casual and trust-heavy settings intentionally, where conversations can surface such problems.</li>
  <li><strong>Be involved early.</strong> The early conversations of ideation, direction setting, and strategising are pivotal in building products, and engineering insight is very valuable in these conversations. Usually engineering is only involved in later stages.</li>
  <li><strong>Find cheap ways to prove hypotheses.</strong> What’s the thinnest <a href="https://blog.nilenso.com/blog/2025/09/17/the-common-sense-unit-of-work/">slice of cake</a> we can offer customers? Can we scrounge up some analytics that can make the case? What’s the simplest experiment we can run? Should we build a PoC, or fire a <a href="https://wiki.c2.com/?TracerBullets">tracing bullet</a> that tells us how expensive this is going to be?</li>
  <li><strong>Don’t wait for product to propose solutions.</strong> You can introduce new solution spaces as engineers that product might not even know exist. Don’t be afraid to push boundaries to make meaningful progress.</li>
  <li><strong>Create space for unsaid ideas.</strong> PMs often have solutions or directions in mind, but don’t voice it because they feel too big, risky, or premature. As engineers, you can unlock strategies by asking questions that give them permission to speak openly about the real problem or the bold solution they’re thinking about.</li>
  <li><strong>Build feasibility intuition:</strong> It is your responsibility to bucket ideas along the spectrum from pipe-dream to practical to trivial. If the product manager is informed on which things take hours, days, weeks, months, they can make much more reasonable decisions than otherwise. If you don’t know enough to make that kind of call, learn, and build that capability. Be resourceful, have trustworthy sources, look at prior art or competitors, and then form your own opinion and share it with product.</li>
  <li><strong>Surface system constraints:</strong> Talk to product about things that are generally considered engineering territory, like missing abstractions, why integration with some third-party is slow and buggy, why adding another workflow is complex, etc. This transparency builds empathy and trust. When you surface these constraints, they can proactively reorient product strategy to account for them. Or perhaps give room to solve for those constraints.</li>
</ul>

<p><img src="/images/blog/germination.png" alt="" /></p>
]]></content>
    
  </entry>
  
  <entry>
    
    <title type="html"><![CDATA[A Short Lesson in Simpler Prompts]]></title>
    <link href="http://blog.nilenso.com/blog/2025/11/04/a-short-lesson-in-simpler-prompts/"/>
    
    <updated>2025-11-04T00:00:00+00:00</updated>
    <id>http://blog.nilenso.com/blog/2025/11/04/some-iterative-prompt-hacking-title</id>
    <summary type="html"><![CDATA[When building context-viewer, I used LLMs to analyse language semantics. I went from 300-word prompts that barely worked to 15-word prompts that worked quite well. I learned about working with LLMs instead of fighting them, and to balance AI with plain old engineering.
]]></summary>
    
    <content type="html"><![CDATA[<p>When building <a href="https://blog.nilenso.com/blog/2025/10/29/fight-context-rot-with-context-observability/">context-viewer</a>, I used LLMs to analyse language semantics. I went from 300-word prompts that barely worked to 15-word prompts that worked quite well. I learned about working with LLMs instead of fighting them, and to balance AI with plain old engineering.</p>

<p>The “secret sauce” is basically:</p>

<div style="border-left:4px solid #FF3D84; background:#fafafa; padding:0.75em 1em; border-radius:4px; margin:1em 0;">

  <ul>
    <li>Learn <a href="https://blog.nilenso.com/blog/2025/10/14/bitter-lesson-applied-ai/">what the model is good at</a>, and lean into its strengths</li>
    <li>Break down the problem, or mould it to fit the strengths</li>
    <li>Engineer around limitations of the model</li>
  </ul>
</div>

<p>Two main problems I had to solve were segmentation and categorisation.</p>

<h2 id="segmentation">Segmentation</h2>
<p>The problem here is to pull apart a single message from an assistant, or prompt from the user, into meaningful chunks like various text or code blocks, instructions, files supplied as context, etc. Here’s an example input message, which is a typical wall-of-text in the context window.</p>

<details div="">
  <summary>
Example `user` input message
</summary>

  <div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"Given the <span class="nt">&lt;product_requirements_document&gt;</span>, the <span class="nt">&lt;technical_specification_document&gt;</span>, and the <span class="nt">&lt;repository_structure&gt;</span>, identify what questions the user should know about the codebase when planning to work on this PRD.<span class="se">\n\n</span><span class="nt">&lt;product_requirements_document&gt;</span><span class="se">\n</span># Grand Central — MVP PRD<span class="se">\n\n</span><span class="gs">**DRI:**</span><span class="ni">&amp;#x20;</span><span class="se">\n</span><span class="gs">**Stakeholders:**</span> People Ops, Partners, Engineering, Design, Data<span class="se">\n</span><span class="gs">**Status:**</span> Draft v1<span class="se">\n</span><span class="gs">**Date:**</span><span class="se">\n\n</span>---<span class="se">\n\n</span>## 1) Problem statement<span class="se">\n\n</span>Partners and Execs cannot quickly answer basic workforce questions because employee data is scattered across spreadsheets, lacks a single source of truth, and has incomplete history of state changes. This creates slow decisions, duplicated work, and error-prone analysis.<span class="se">\n\n</span><span class="gs">**Evidence to collect pre-ship**</span><span class="se">\n\n</span><span class="ge">* Time spent per week answering roster questions.\n*</span> Number of duplicate or conflicting records in current Sheets.<span class="se">\n</span><span class="ge">* Top 5 recurring questions stakeholders cannot answer reliably today.\n\n---\n\n## 2) Who is the customer\n\n*</span> <span class="gs">**Primary:**</span> People Ops admin who maintains records and exports data.<span class="se">\n</span><span class="ge">* **Secondary:** Partners/Managers who need roster and trend views.\n*</span> <span class="gs">**Tertiary:**</span> Employees who look up colleague basics.<span class="se">\n\n</span><span class="gs">**Jobs-to-be-done**</span><span class="se">\n\n</span><span class="ge">* Maintain canonical person records with auditable history.\n*</span> Import legacy data once with minimal cleanup.<span class="se">\n</span><span class="ge">* Run simple trend analyses without a data analyst.\n\n---\n\n## 3) Goals and non-goals\n\n**Goals (MVP)**\n\n1. Create, edit, and view **person** records with append-only **changesets** and effective dates.\n2. **One-time bulk import** from Google Sheets with dedupe by work email.\n3. **Directory** with search and CSV export of the **current snapshot**.\n4. **Analytics v0:** monthly headcount, joiners vs leavers, level distribution over time; simple “level vs prior experience” view.\n5. **Security:** Google SSO, domain allowlist, audit log on writes.\n\n**Non-goals (MVP)**\n\n*</span> Payroll, performance reviews, external HRIS sync.<span class="se">\n</span><span class="ge">* Complex RBAC beyond Admin vs Viewer.\n*</span> Compensation reporting UI. <span class="ge">*Schema ready; UI deferred to v2.*</span><span class="se">\n\n</span><span class="gs">**Why now**</span><span class="se">\n\n</span><span class="ge">* Operational risk from spreadsheet fragmentation.\n*</span> Quick wins unlock broader analytics later.<span class="se">\n\n</span>---<span class="se">\n\n</span>## 5) Requirements<span class="se">\n\n</span>### 5.1 Functional<span class="se">\n\n</span><span class="ge">* **Auth:** Google SSO; domain allowlist; logout; session security.\n*</span> <span class="gs">**Directory:**</span> list with columns <span class="ge">*Name, Work email, Level, Status, Start date*</span>; search by name/email; link to person detail; CSV export of current snapshot.<span class="se">\n</span><span class="ge">* **Person detail:** core fields, plus **History** tab showing changesets in reverse chronological order; show who changed what and when.\n*</span> <span class="gs">**Create/Edit:**</span> forms capture effective dates; all edits append a changeset; current snapshot recomputed.<span class="se">\n</span><span class="ge">* **Analytics v0:**\n\n  *</span> <span class="ge">*Headcount trend*</span> by month (active count, joineive dates.<span class="se">\n</span>2. <span class="gs">**One-time bulk import**</span> from Google Sheets with dedupe by work email.<span class="se">\n</span>3. <span class="gs">**Directory**</span> with search and CSV export of the <span class="gs">**current snapshot**</span>.<span class="se">\n</span>4. <span class="gs">**Analytics v0:**</span> monthly headcount, joiners vs leavers, level distribution over time; si
mple “level vs prior experience” view.<span class="se">\n</span>5. <span class="gs">**Security:**</span> Google SSO, domain allowlist, audit log on writes.<span class="se">\n\n</span><span class="gs">**Non-goals (MVP)**</span><span class="se">\n\n</span><span class="ge">* Payroll, performance reviews, external HRIS sync.\n*</span> Complex RBAC beyond Admin vs Viewer.<span class="se">\n</span><span class="err">*</span> Compensation reporting U
I. <span class="ge">*Schema ready; UI deferred to v2.*</span><span class="se">\n\n</span><span class="gs">**Why now**</span><span class="se">\n\n</span><span class="ge">* Operational risk from spreadsheet fragmentation.\n*</span> Quick wins unlock broader analytics later.<span class="se">\n\n</span>---<span class="se">\n\n</span>## 5) Requirements<span class="se">\n\n</span>### 5.1 Functional<span class="se">\n\n</span><span class="ge">* **Auth:*</span><span class="err">*</span> Google SSO; domain allowlist; logo
ut; session security.<span class="se">\n</span><span class="ge">* **Directory:** list with columns *</span>Name, Work email, Level, Status, Start date<span class="ge">*; search by name/email; link to person detail; CSV export of current snapshot.\n*</span> <span class="gs">**Person detail:**</span> core fields, plus <span class="gs">**History**</span> tab showing changes
ets in reverse chronological order; show who changed what and when.<span class="se">\n</span><span class="ge">* **Create/Edit:** forms capture effective dates; all edits append a changeset; current snapshot recomputed.\n*</span> <span class="gs">**Analytics v0:**</span><span class="se">\n\n</span>  <span class="ge">* *</span>Headcount trend<span class="err">*</span> by month (active count, joine
rs, leavers).<span class="se">\n</span>  <span class="ge">* *</span>Level mix<span class="ge">* over time (banded junior/mid/senior).\n  *</span> <span class="ge">*Level vs prior experience*</span> scatter.<span class="se">\n</span><span class="ge">* **Import:** one-time CSV importer for legacy Sheets; idempotent; validation report; dedupe by work email; mapping guide.\n\n### 5.2 Data model (MVP)\n\n*</span> <span class="gs">**Person**</span>: id, firstName, lastName, workEmail (unique), phone?, role, level, status, startDate, endDate?, priorWorkExperienceYears.<span class="se">\n</span><span class="ge">* **Changeset**: id, personId, field(s) changed, newValue, effectiveDate, author, createdAt.\n*</span> <span class="gs">**Status enum**</span>: FULLTIME, CONTRACTOR, EXEC, PARTNER, RESIGNED.<span class="se">\n</span><span class="ge">* **Compensation (v2-ready)**: CompChange(personId, amount, currency, effectiveDate, notes).\n*</span> <span class="gs">**Snapshot rule**</span>: latest effective changes per field as of “now”.<span class="se">\n\n</span>### 5.4 UX principles<span class="se">\n\n</span><span class="ge">* Defaults fast data entry over perfect taxonomy.\n*</span> Make history obvious before saving edits.<span class="se">\n</span><span class="ge">* Show what changed, by whom, and when.\n\n---\n\n## 8) Risks and mitigations\n\n*</span> <span class="gs">**Import correctness**</span> → schema mapping guide, dry-run, row-level report.<span class="se">\n</span><span class="ge">* **Duplicate records** → unique email constraint; surface potential duplicates; merge flow later.\n*</span> <span class="gs">**Bad effective dates**</span> → inline validation; preview of resulting history.<span class="se">\n</span><span class="ge">* **OAuth misconfig** → automated env checks in CI; clear runbooks.\n\n---\n\n## 9) Acceptance tests (MVP)\n\n1. **Create person**: Authenticated user submits required fields → person appears in directory; audit entry created; event `person_created` emitted.\n2. **Edit with history**: Update level with effective date → new changeset stored; History tab shows entry; snapshot updated.\n3. **Import**: Run importer on validated CSV → ≥95% rows ingested; reconciliation report shows any rejects with reasons.\n4. **Export**: Click Export on directory → CSV downloads with one row per current person; header spec matches appendix.\n5. **Analytics**: Open Analytics → monthly headcount, joiners vs leavers, and level mix charts render from production data; “level vs experience” view loads.\n6. **Security**: Unauthenticated user → redirected to login; export requires Admin.\n\n---\n\n## 10) Open questions\n\n*</span> Exact mapping of legacy Sheets to entities and enums.<span class="se">\n</span><span class="ge">* Admin vs Viewer permissions beyond export.\n*</span> Compensation governance and who can view amounts in v2.<span class="se">\n</span><span class="ge">* Do managers need edit rights or view-only in v1?\n\n---\n\n## 11) Appendix\n\n**A. CSV header spec (current snapshot)**\n`firstName,lastName,workEmail,phone,role,level,status,startDate,endDate,priorWorkExperienceYears`\n\n**B. Glossary**\n\n*</span> <span class="gs">**Changeset**</span>: append-only record of a field change with an effective date.<span class="se">\n</span><span class="ge">* **Snapshot**: latest effective value per field at a point in time.\n*</span> <span class="gs">**Headcount**</span>: number of active employees in a period.<span class="se">\n</span><span class="ge">* **Joiners/Leavers**: counts of start/end effective events in a period.\n\n**C. Decision log**\n\n*</span> Compensation UI deferred to v2; schema included now.<span class="se">\n</span><span class="ge">* Unique workEmail enforced; no merge UI in v1.\n*</span> SQLite acceptable for MVP, to be revisited post-M6.<span class="se">\n\n\n</span><span class="nt">&lt;/product_requirements_document&gt;</span><span class="se">\n\n</span><span class="nt">&lt;technical_specification_document&gt;</span><span class="se">\n</span># Modelling<span class="se">\n\n</span>Describes how to model entities over time.<span class="se">\n\n</span>## Goals<span class="se">\n\n</span>Model an entity with a set of known fields which changes over time such that:<span class="se">\n\n</span>1. The time when a change occurred is tracked separately from when the change<span class="se">\n</span>   was recorded<span class="se">\n\n</span>2. The state of an entity (a view) can be queried for any point in time to<span class="se">\n</span>   fetch the values of fields in that entity at that point in time<span class="se">\n\n</span>3. A change that occurred in the past should affect all views of the entity<span class="se">\n</span>   that are requested after that point in time<span class="se">\n\n</span>4. The end user should not have to declare changes but instead just edits the<span class="se">\n</span>   entity as a whole at a given point in time<span class="se">\n\n</span>## CRDTs<span class="se">\n\n</span>Ref: https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type<span class="se">\n\n</span>Typically used for resolving conflicts between information created from two different sources, we are able to use CRDTs for our usecase by treating a change made to an entity's history as a change from a disconnected user that's converging with other changes.<span class="se">\n\n</span>### LWW Element Set<span class="se">\n\n</span>Ref: https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type#LWW-Element-Set_(Last-Write-Wins-Element-Set)<span class="se">\n\n</span>An LWW Element Set is a well known CRDT. It consists of an 'add set', 'remove<span class="se">\n</span>set' and a time stamp for each element in it.<span class="se">\n\n</span>The timestamp will be the time at which the change occurred. The value is a map<span class="se">\n</span>with all the fields and corresponding values that are updated. In our case, the<span class="se">\n</span>remove set does not exist since nothing is meant to be removed, just edited. We<span class="se">\n</span>may mimic a remove by updating a field with a null value if necessary.<span class="se">\n\n</span>## Schema<span class="se">\n\n</span>The schema to store this in SQLite.<span class="se">\n\n</span><span class="sb">```sql\n\ncreate table changes(\n  -- This would be the personId in our case\n  entityId text,\n\n  -- Time when the event occurred\n  timestamp datetime,\n\n  -- json with fields that have changed and the corresponding values\n  -- eg. {'name': 'Name1', 'age': 22}, {'age': 29}\n  value text,\n);\n\n```</span><span class="se">\n\n</span>Here’s a concise tech spec for “Grand Central.” It captures current stack choices and the target architecture, flags risks, and proposes guardrails.<span class="se">\n\n</span># 1) Overview<span class="se">\n\n</span><span class="ge">* Purpose: Single source of truth for employee data. Visual analysis now. Workflow automation later.\n*</span> Users: Internal staff. Auth via Google.<span class="se">\n</span><span class="ge">* Non-goals: Public API, multi-tenant SaaS, heavy analytics.\n\n# 2) Core Requirements\n\n*</span> CRUD for people and changesets.<span class="se">\n</span><span class="ge">* Authenticated access only.\n*</span> Visual graphs of trends.<span class="se">\n</span><span class="ge">* Low-ops deploy. Small team ownership.\n*</span> SQLite durability with simple backups. Path to Postgres when needed.<span class="se">\n\n</span># 3) Tech Stack (current)<span class="se">\n\n</span><span class="ge">* Web framework: Remix (React + SSR). TypeScript everywhere.\n*</span> Styling: Tailwind.<span class="se">\n</span><span class="ge">* ORM: Prisma.\n*</span> DB: SQLite (file URL defaults to <span class="sb">`file:/data/sqlite.db`</span>).<span class="se">\n</span><span class="ge">* Auth: `remix-auth` + Google OAuth.\n*</span> Charts: D3 components.<span class="se">\n</span><span class="ge">* Testing: Vitest for unit, Cypress for E2E.\n*</span> Packaging: Docker multi-stage build. Separate Datasette image for read/admin.<span class="se">\n</span><span class="ge">* Runtime: Node process started by `start.sh` which runs `prisma migrate deploy` then `remix-serve`.\n*</span> Ops: Procfile targets (<span class="sb">`web`</span>, <span class="sb">`datasette`</span>). Fly volume mounted at <span class="sb">`/data`</span>. Healthcheck at <span class="sb">`/healthcheck`</span>.<span class="se">\n</span>  Sources: repo Dockerfiles, README, Fly config.<span class="ni">&amp;#x20;</span><span class="se">\n\n</span># 4) Architecture<span class="se">\n\n</span>## 4.1 Logical components<span class="se">\n\n</span><span class="ge">* **Remix app**\n\n  *</span> Server routes: SSR HTML, actions for mutations, loaders for reads.<span class="se">\n</span>  <span class="ge">* Session: Cookie session storage.\n  *</span> Auth: Google Strategy. On first login auto-creates user.<span class="se">\n</span>  <span class="ge">* Data access: Prisma client.\n  *</span> Graphs: D3 line and multiseries components rendered client-side.<span class="se">\n</span><span class="ge">* **SQLite**\n\n  *</span> Deployed at <span class="sb">`/data/sqlite.db`</span> with a persistent volume.<span class="se">\n</span><span class="ge">* **Datasette**\n\n  *</span> Read and exploratory UI at <span class="sb">`/datasette/`</span>.<span class="se">\n</span>  <span class="ge">* Insert bot capability gated by token for controlled writes.\n*</span> <span class="gs">**Reverse proxy**</span><span class="se">\n\n</span>  <span class="ge">* Local dev uses Caddy via `docker-compose.local.yml` to route to Node and Datasette.\n\n## 4.2 Runtime topology\n\n*</span> <span class="gs">**Fly.io app**</span><span class="se">\n\n</span>  <span class="ge">* Service ports: 80/443 -&gt; internal 8080.\n  *</span> Volume mount <span class="sb">`/data`</span> for DB durability.<span class="se">\n</span>  <span class="ge">* Concurrency limits tuned to \\~20–25 connections.\n*</span> <span class="gs">**Datasette sidecar**</span><span class="se">\n\n</span>  <span class="ge">* Runs as a separate process/container when enabled.\n  *</span> Base URL set to <span class="sb">`/datasette/`</span>.<span class="se">\n\n</span>## 4.3 Request flow<span class="se">\n\n</span>1. Browser → Remix route.<span class="se">\n</span>2. Loader checks session. If needed, Google OAuth redirects.<span class="se">\n</span>3. Loader/Action hits Prisma → SQLite.<span class="se">\n</span>4. Response rendered via SSR + client hydration.<span class="se">\n</span>5. Optional: Datasette used for admin/read operations.<span class="se">\n\n</span># 5) Configuration and Secrets<span class="se">\n\n</span><span class="ge">* Required env:\n\n  *</span> <span class="sb">`SESSION_SECRET`</span><span class="se">\n</span>  <span class="ge">* `GOOGLE_CLIENT_ID`, `GOOGLE_CLIENT_SECRET`\n  *</span> <span class="sb">`APP_URL`</span> (for OAuth callback)<span class="se">\n</span>  <span class="ge">* `DATABASE_URL` (defaults to `file:/data/sqlite.db`)\n  *</span> <span class="sb">`PORT`</span> (default 8080)<span class="se">\n</span>  <span class="ge">* `INSERT_TOKEN` for Datasette insert bot\n*</span> Secret handling: Store in Fly secrets or CI secrets. Never in repo. Rotate quarterly.<span class="se">\n\n</span># 6) Build and Deploy<span class="se">\n\n</span><span class="ge">* **Build**: Multi-stage Docker builds:\n\n  *</span> <span class="sb">`deps`</span> installs dev deps, <span class="sb">`production-deps`</span> prunes, <span class="sb">`build`</span> runs <span class="sb">`prisma generate`</span> and <span class="sb">`npm run build`</span>, final image copies <span class="sb">`/build`</span>, <span class="sb">`/public`</span>, <span class="sb">`.prisma`</span>.<span class="se">\n</span><span class="ge">* **Start**: `start.sh` applies migrations then serves.\n*</span> <span class="gs">**Datasette image**</span>: Minimal Python Alpine, installs <span class="sb">`datasette`</span>, <span class="sb">`datasette-insert`</span>, <span class="sb">`datasette-auth-tokens`</span>, serves <span class="sb">`/data/sqlite.db`</span>.<span class="se">\n</span><span class="ge">* **CI/CD**: GitHub Actions push images to registry and deploy. Branching model: `dev` → staging, `main` → production.\n  Note: README mentions DigitalOcean registry and droplet; repo also contains `fly.toml`. Standardize on one target (recommend Fly for volume + healthchecks).\n\n# 7) Data Model and Migrations\n\n*</span> ORM-driven schema via Prisma.<span class="se">\n</span><span class="ge">* Migration policy:\n\n  *</span> All schema changes via Prisma migrations committed to VCS.<span class="se">\n</span>  <span class="ge">* On boot, `prisma migrate deploy` runs. Fail fast if migration cannot be applied.\n*</span> Backups:<span class="se">\n\n</span>  <span class="ge">* Nightly snapshot of `/data/sqlite.db` to object storage.\n  *</span> Pre-migration snapshot in CI for production.<span class="se">\n\n</span># 8) Observability<span class="se">\n\n</span><span class="ge">* Logs: stdout structured JSON from app. Retain 14–30 days in platform logs.\n*</span> Health: <span class="sb">`/healthcheck`</span> HTTP 200.<span class="se">\n</span><span class="ge">* Metrics (proposed):\n\n  *</span> Basic RED metrics via runtime counters.<span class="se">\n</span>  <span class="ge">* Optionally expose `/metrics` for scrape.\n\n# 9) Security\n\n*</span> HTTPS enforced.<span class="se">\n</span><span class="ge">* Cookie `__session`: httpOnly, sameSite=lax, secure in production.\n*</span> OAuth scopes: minimal email identity.<span class="se">\n</span><span class="ge">* RBAC (future): roles for viewer, editor, admin.\n*</span> Datasette:<span class="se">\n\n</span>  <span class="ge">* Base URL scoped under `/datasette/`.\n  *</span> Token-gated inserts via <span class="sb">`datasette-auth-tokens`</span>.<span class="se">\n</span>  <span class="ge">* Default deny policy except for authenticated actor.\n\n# 10) Performance and Scale\n\n*</span> SQLite fits current workload. Single writer, low read concurrency.<span class="se">\n</span><span class="ge">* Concurrency guard at app level to avoid lock thrash.\n*</span> Thresholds to move to Postgres:<span class="se">\n\n</span>  <span class="ge">* &gt; 10 req/sec sustained writes.\n  *</span> Multi-region rollout.<span class="se">\n</span>  <span class="ge">* Complex reporting joins.\n*</span> Migration path:<span class="se">\n\n</span>  <span class="ge">* Replace `DATABASE_URL` to Postgres.\n  *</span> Run Prisma <span class="sb">`migrate deploy`</span> on Postgres.<span class="se">\n</span>  <span class="ge">* Cutover via maintenance window.\n\n# 11) Testing Strategy\n\n*</span> Unit: Vitest, <span class="sb">`happy-dom`</span> environment.<span class="se">\n</span><span class="err">*</span> E2E: Cypress with <span class="sb">`login()`</span> helper and <span class="sb">`cleanupUse
r()`</span> per test file.<span class="se">\n</span><span class="ge">* CI gates: typecheck, lint, unit, E2E against mocked services.\n\n# 12) Risks and Mitigations\n\n*</span> <span class="gs">**SQLite file lock contention**</span> → keep mutations small; queue bulk imports; consider Postgres if contention spikes.<span class="se">\n</span><span class="ge">* *</span><span class="err">*</span>Dual deploy
ment targets confusion<span class="ge">**</span> → choose Fly or DO. Remove unused manifests.<span class="se">\n</span><span class="ge">* **Secrets leakage** → enforce CI secret scanning; restrict Procfile usage in prod if not needed.\n\n# 13) Open Questions\n\n*</span> Is Datasette exposed in production or only behind VPN?
 If exposed, add auth proxy.<span class="se">\n</span><span class="ge">* Which registry is canonical: DOCR or Fly’s? Align CI/CD.\n*</span> Backup RTO/RPO targets?<span class="se">\n\n</span># 14) Acceptance Criteria<span class="se">\n\n</span><span class="ge">* Auth works with Google in staging and prod.\n*</span> App boots with <span class="sb">`prisma migrate deploy`</span> and serves on 808
0 behind TLS.<span class="se">\n</span><span class="ge">* `/datasette/` loads with token auth.\n*</span> E2E tests pass in CI on each PR to <span class="sb">`dev`</span> and <span class="sb">`main`</span>.<span class="se">\n\n</span>Sources: project Dockerfiles, Procfile, README, Fly config, package.json, Prisma bootstrap scripts.<span class="ni">&amp;#x20;</span><span class="se">\n\n\n</span>&lt;/technical_specification_doc
ument&gt;<span class="se">\n\n</span><span class="nt">&lt;repository_structure&gt;</span><span class="se">\n</span>.dockerignore<span class="se">\n</span>.env.example<span class="se">\n</span>.eslintrc.js<span class="se">\n</span>.github/workflows/deploy.yml<span class="se">\n</span>.gitignore<span class="se">\n</span>.gitpod.Dockerfile<span class="se">\n</span>.gitpod.yml<span class="se">\n</span>.npmrc<span class="se">\n</span>.prettierignore<span class="se">\n</span>Dockerfile<span class="se">\n</span>Dockerfile.datasette<span class="se">\n</span>Procfile<span class="se">\n</span>README.md<span class="se">\n</span>app/charts/line-chart
.ts<span class="se">\n</span>app/charts/multiseries-chart.ts<span class="se">\n</span>app/components/graph.tsx<span class="se">\n</span>app/components/person.tsx<span class="se">\n</span>app/db.server.ts<span class="se">\n</span>app/entry.client.tsx<span class="se">\n</span>app/entry.server.tsx<span class="se">\n</span>app/form-action.server.ts<span class="se">\n</span>app/form.tsx<span class="se">\n</span>app/models/graph.server.test.ts<span class="se">\n</span>app/models/graph.server.ts
<span class="se">\n</span>app/models/graph.ts<span class="se">\n</span>app/models/person.server.test.ts<span class="se">\n</span>app/models/person.server.ts<span class="se">\n</span>app/models/person.ts<span class="se">\n</span>app/models/user.server.ts<span class="se">\n</span>app/root.tsx<span class="se">\n</span>app/routes/auth/google.tsx<span class="se">\n</span>app/routes/auth/google/callback.tsx<span class="se">\n</span>app/routes/graphs/levels.tsx<span class="se">\n</span>app/route
s/graphs/people.tsx<span class="se">\n</span>app/routes/healthcheck.tsx<span class="se">\n</span>app/routes/index.tsx<span class="se">\n</span>app/routes/login.tsx<span class="se">\n</span>app/routes/logout.tsx<span class="se">\n</span>app/routes/people.tsx<span class="se">\n</span>app/routes/people/$id.tsx<span class="se">\n</span>app/routes/people/csv.tsx<span class="se">\n</span>app/routes/people/new.tsx<span class="se">\n</span>app/session.server.ts<span class="se">\n</span>app/utils.
test.ts<span class="se">\n</span>app/utils.ts<span class="se">\n</span>cypress.config.ts<span class="se">\n</span>cypress/.eslintrc.js<span class="se">\n</span>cypress/e2e/smoke.cy.ts<span class="se">\n</span>cypress/fixtures/example.json<span class="se">\n</span>cypress/support/commands.ts<span class="se">\n</span>cypress/support/create-user.ts<span class="se">\n</span>cypress/support/delete-user.ts<span class="se">\n</span>cypress/support/e2e.ts<span class="se">\n</span>cypress/tsconfig
.json<span class="se">\n</span>datasette.metadata.json<span class="se">\n</span>datasette/metadata.json<span class="se">\n</span>deploy/grand-central/Caddyfile<span class="se">\n</span>deploy/grand-central/Caddyfile.local<span class="se">\n</span>deploy/grand-central/docker-compose.caddy.yml<span class="se">\n</span>deploy/grand-central/docker-compose.local.yml<span class="se">\n</span>deploy/grand-central/docker-comp
ose.staging.yml<span class="se">\n</span>deploy/grand-central/docker-compose.yml<span class="se">\n</span>doc/modelling.md<span class="se">\n</span>fly.toml<span class="se">\n</span>mocks/README.md<span class="se">\n</span>mocks/index.js<span class="se">\n</span>package-lock.json<span class="se">\n</span>package.json<span class="se">\n</span>prisma/generated/zod/index.ts<span class="se">\n</span>prisma/migrations/20230120092252_init/migration.sql<span class="se">\n</span>prisma/migrations
/20230125174823_add_changeset/migration.sql<span class="se">\n</span>prisma/migrations/20230128042107_update_changeset/migration.sql<span class="se">\n</span>prisma/migrations/20230223201415_drop_person_table/migration.sql<span class="se">\n</span>prisma/migrations/20230407085725_drop_user_password/migration.sql<span class="se">\n</span>prisma/mig
rations/migration_lock.toml<span class="se">\n</span>prisma/schema.prisma<span class="se">\n</span>prisma/seed.ts<span class="se">\n</span>public/favicon.ico<span class="se">\n</span>public/grand-central.jpg<span class="se">\n</span>remix.config.js<span class="se">\n</span>remix.env.d.ts<span class="se">\n</span>scripts/import_changesets.ts<span class="se">\n</span>start.sh<span class="se">\n</span>tailwind.config.js<span class="se">\n</span>test/global-setup.ts<span class="se">\n</span>test/setup-test-env.ts<span class="se">\n</span>t
sconfig.json<span class="se">\n</span>vitest.config.ts<span class="se">\n</span><span class="nt">&lt;/repository_structure&gt;</span><span class="se">\n</span>

</code></pre></div>  </div>

</details>

<p>An AI conversation would have an array of such <code class="language-plaintext highlighter-rouge">messages</code> array with <code class="language-plaintext highlighter-rouge">user</code>, <code class="language-plaintext highlighter-rouge">assistant</code>, <code class="language-plaintext highlighter-rouge">tool</code>, represented in JSON. I need to preserve the semantic meaning while chunking, and the flow of text too. I can’t change the order of text, since that matters to an LLM. Since I’m working with inputs that are prompts or skill markdown files, I was likely to have delimiters that were XML / Markdown, etc.</p>

<p>I wanted to get a list of “splits”, so that I could replace a single message in-place with a list of smaller messages. So, I started with this prompt.</p>

<details>

  <summary>
<code>Initial prompt</code>
</summary>

  <div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Given a structured JSON containing message parts, split any part that combines multiple distinct ideas into smaller, self-contained units. Each resulting unit must represent <span class="gs">**one classifiable concept or function**</span>, preserving all meaning, order, and structure. This prepares the data for hierarchical categorisation. Output <span class="gs">**only**</span> the complete replacements JSON object described.

Return <span class="gs">**only**</span> a single JSON object in this format:
{
  "replacements": [
    {
      "source_part_id": "42",
      "target_parts": [
        {
          "id": "42.1",
          "type": same_as_source_type",
          "text": "subpart content 1"
        },
        {
          "id": "42.2",
          "type": same_as_source_type",
          "text": "subpart content 2"
        }
      ]
    }
  ]
}

</code></pre></div>  </div>

</details>

<p>I pasted this prompt into a ChatGPT conversation, attached a messages.json, and started hacking away, trying to find a prompt that worked reasonably. And here’s what I got:</p>

<details>
  <summary>
Sample response
</summary>

  <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"replacements"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"source_part_id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"target_parts"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2.1"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"text"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"text"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Given the &lt;product_requirements_document&gt;, identify questions the user should know about the PRD goals, customer roles, and scope."</span><span class="w">
        </span><span class="p">},</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2.2"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"text"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"text"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Given the &lt;product_requirements_document&gt;, identify questions about the data model, including Person and Changeset entities, and how snapshots are computed over time."</span><span class="w">
        </span><span class="p">},</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2.3"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"text"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"text"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Given the &lt;product_requirements_document&gt;, identify questions about analytics features such as headcount trends, joiners vs leavers, and level distribution over time."</span><span class="w">
        </span><span class="p">},</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2.4"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"text"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"text"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Given the &lt;technical_specification_document&gt;, identify questions regarding the use of CRDTs, specifically LWW Element Sets, and how they handle edits and historical views."</span><span class="w">
        </span><span class="p">},</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2.5"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"text"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"text"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Given the &lt;technical_specification_document&gt;, identify questions about the tech stack, including Remix, Prisma, SQLite, Fly.io deployment, and Datasette integration."</span><span class="w">
        </span><span class="p">},</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2.6"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"text"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"text"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Given the &lt;technical_specification_document&gt;, identify questions about authentication via Google OAuth, session management, and secret configuration."</span><span class="w">
        </span><span class="p">},</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2.7"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"text"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"text"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Given the &lt;technical_specification_document&gt;, identify questions about CI/CD pipelines, Docker builds, migration policies, and backup processes."</span><span class="w">
        </span><span class="p">},</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2.8"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"text"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"text"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Given the &lt;repository_structure&gt;, identify which source files implement routes, models, and components related to people, graphs, and authentication."</span><span class="w">
        </span><span class="p">},</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2.9"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"text"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"text"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Given the &lt;repository_structure&gt;, identify where database schema migrations, Prisma configuration, and data seeding occur."</span><span class="w">
        </span><span class="p">},</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2.10"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"text"</span><span class="p">,</span><span class="w">
          </span><span class="nl">"text"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Given the &lt;repository_structure&gt;, identify files relevant for end-to-end testing, including Cypress and Vitest configurations."</span><span class="w">
        </span><span class="p">}</span><span class="w">
      </span><span class="p">]</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">

</span></code></pre></div>  </div>
</details>

<p>The issues were:</p>

<ul>
  <li>It only returned a single replacement most of the times
    <ul>
      <li>I thought this was because I only had one replacement in the sample response. I added two replacements in the response and the issue reduced, but persisted.</li>
      <li>I asked chatgpt why it only returned one replacement. It told me that I had said <strong>any</strong> part, not <strong>all</strong> parts. Silly me. I fixed that, but the issue persisted.</li>
    </ul>
  </li>
  <li>It didn’t return full message chunks, it would snip them with `…` in between or write a summary that represented the chunk.</li>
  <li>It was very slow, took &gt;30s in most cases. The reasoning seemed solid, took its time, but the result didn’t reflect that thought. It was slow even without reasoning.</li>
  <li>It tried to use code tools, but I didn’t want it to take so much time (I specifically wanted low latency), and I wanted a generic solution.</li>
  <li>I additionally instructed it to “output exact text spans”, added principles on why I wanted it that way, etc. Still no luck.</li>
</ul>

<p>There were a few other issues around the json structure, preserving additional fields, etc. I also added a couple of guiding examples.
And at the end of these iterations, here’s the prompt I got to:</p>

<details>
  <summary>
<code>Detailed prompt with all the fixes</code>
</summary>

  <div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gu">## **Task**</span>

Segment a structured JSON containing message parts into <span class="gs">**atomic semantic units**</span>.
Each resulting unit must represent a single, self-contained <span class="gs">**intent, fact, or function**</span>, suitable for hierarchical categorisation.
Preserve all original wording, order, and structure.
<span class="p">
---
</span>
<span class="gu">## **Segmentation Rules**</span>
<span class="p">
*</span> Each atomic unit should express <span class="gs">**one topic, question, instruction, or operation**</span>.
<span class="p">*</span> If a part contains multiple such elements, extract them as separate contiguous spans.
<span class="p">*</span> <span class="gs">**Do not paraphrase, omit, or reorder**</span> any text.
<span class="p">*</span> <span class="gs">**Preserve all tokens exactly**</span> as in the source.
<span class="p">*</span> Use existing natural boundaries such as XML/HTML tags, JSON objects, Markdown headers, list items, or paragraphs.
<span class="p">*</span> Code blocks, tool calls, and similar technical sections must remain whole.
<span class="p">*</span> Maintain the original hierarchy and <span class="sb">`type`</span> values.
<span class="p">
---
</span>
<span class="gu">## **Extraction Objective**</span>

Identify and extract spans that each convey a single semantic role.
Think of this as <span class="gs">**semantic segmentation for classification**</span>, not text rewriting.
Output exact text spans that can stand alone and be categorised unambiguously.
<span class="p">
---
</span>
<span class="gu">## **Output Format**</span>

Return only one JSON object in this format:

{
  "replacements": [
    {
      "source_part_id": "42",
      "target_parts": [
        {
          "id": "42.1",
          "type": "same_as_source_type",
          "text": "exact text span 1"
        },
        {
          "id": "42.2",
          "type": "same_as_source_type",
          "text": "exact text span 2"
        }
      ]
    },
    {
      "source_part_id": "84",
      "target_parts": [
        {
          "id": "84.1",
          "type": "same_as_source_type",
          "text": "exact text span 1"
        },
        {
          "id": "84.2",
          "type": "same_as_source_type",
          "text": "exact text span 2"
        }
      ]
    }
  ]
}

Each <span class="sb">`source_part_id`</span> corresponds to one original message part that was segmented.
Each <span class="sb">`target_part`</span> contains one extracted semantic unit, preserving order and meaning.
</code></pre></div>  </div>

</details>

<p>The next day, I woke up thinking: <em>“why is this so difficult, I thought LLMs are good at this stuff”</em>. And then I tried the simplest prompt to test the hypothesis that I’m using LLMs wrong.</p>

<div class="wrap-code">

  <div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>given the following text, tell me where all you would apply a break.
just give me a json array.

<span class="gu">## result</span>
[
  "<span class="nt">&lt;task&gt;</span>",
  "<span class="nt">&lt;sources&gt;</span>",
  "<span class="nt">&lt;project_requirements_document&gt;</span>",
  "<span class="nt">&lt;technical_specification_document&gt;</span>",
  "<span class="nt">&lt;repository_context&gt;</span>",
  "<span class="nt">&lt;breakdown&gt;</span>",
  "<span class="nt">&lt;reflection&gt;</span>"
]
</code></pre></div>  </div>
</div>

<p>Woo! The results were instant (down to 5s from 20s), and exactly what I expected. The JSON input was likely interfering with its capability in identifying semantics, so I sent it only the text. And I didn’t need it to do the actual text-splitting, <code class="language-plaintext highlighter-rouge">string.split</code> could do that. I could also do this in parallel for all the messages that needed to be split. After some more tweaking of instructions, I got to this prompt.</p>

<div class="wrap-code">

  <div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Given the following text, tell me where all you would apply a break. The purpose is semantic chunking in way that's suitable for categorisation. Only give me the top level sections to split the text into coherent topical chunks.

Return ONLY a valid JSON array of regexes with positive lookahead which I can use to run string split in javascript.

Example response format: ["(?=regex-of-section-1)", "(?=regex-of-section2)"]
</code></pre></div>  </div>
</div>

<p>And without even structured outputs, this worked every time, within a few seconds. No reasoning, and no coding tools.</p>

<div style="border-left:4px solid #FF3D84; background:#fafafa; padding:0.75em 1em; border-radius:4px; margin:1em 0;">

  <h3 id="summary">Summary</h3>

  <p>Before:</p>

  <ul>
    <li><strong>AI:</strong> One (overengineered) prompt to identify large messages, identify semantic chunks, and split up messages accordingly.</li>
  </ul>

  <p>After:</p>

  <ul>
    <li><strong>Engineering:</strong> Identify large messages</li>
    <li><strong>Engineering:</strong> Create one prompt per message</li>
    <li><strong>AI:</strong> Identify semantic chunks given plain text, return a JSON array of substrings / regexes</li>
    <li><strong>Engineering:</strong> Split up messages</li>
  </ul>
</div>

<h2 id="categorisation">Categorisation</h2>

<p>After breaking down messages into smaller chunks, I had to categorise them. So in the same manner, I iterated with my prompt and inputs on ChatGPT until I could find something that worked reasonably well.</p>

<p>Here’s a detailed description of my task, that became a prompt:</p>
<details>
  <summary>
<code> Initial Prompt </code>
</summary>

  <div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gs">**Goal**</span>
Produce a <span class="gs">**hierarchical category map**</span> that shows how information is organized in a conversation. Each category aggregates related message parts, summaries, and structure, enabling visualization and navigation of context usage.

<span class="gs">**Instruction**</span>
Given a structured conversation where each message part has a unique <span class="sb">`message_part_id`</span>, build a JSON tree that groups the conversation into semantically coherent categories and subcategories.

Do <span class="gs">**not**</span> use code tools or programmatic parsing for this task. Use reasoning and language understanding only.

<span class="gu">### Your task</span>
<span class="p">1.</span> <span class="gs">**Identify major categories**</span> – infer the dominant conceptual or functional blocks from the conversation (for example: <span class="ge">*Checklist of questions*</span>, <span class="ge">*File reads*</span>, <span class="ge">*Reasoning*</span>, <span class="ge">*Decisions*</span>).
<span class="p">2.</span> <span class="gs">**Decompose recursively**</span> – create subcategories only where the material naturally divides into smaller, meaningful topics.
<span class="p">   -</span> Do <span class="gs">**not**</span> fix the number of levels; infer depth as needed.
<span class="p">3.</span> <span class="gs">**Assign message parts**</span> – tag each message part with exactly one category or subcategory that best represents its content, using its <span class="sb">`message_part_id`</span>.
<span class="p">4.</span> <span class="gs">**Summarize each category**</span> – every category node, including children, must contain:
<span class="p">   -</span> <span class="sb">`id`</span>: unique short identifier, preferably using dot notation to indicate hierarchy (for example: <span class="sb">`checklist`</span>, <span class="sb">`checklist.data_model`</span>, <span class="sb">`analysis.synthesis`</span>)
<span class="p">   -</span> <span class="sb">`name`</span>: concise label
<span class="p">   -</span> <span class="sb">`summary`</span>: one-sentence description of what this category covers
<span class="p">   -</span> <span class="sb">`message_parts`</span>: array of <span class="sb">`message_part_id`</span>s assigned directly to this category
<span class="p">   -</span> <span class="sb">`children`</span>: nested categories, if any
<span class="p">5.</span> <span class="gs">**Preserve domain terminology**</span> – derive category names from the conversation’s subject matter.
<span class="p">6.</span> <span class="gs">**Output**</span> – return a structured, machine-readable JSON array representing the hierarchy, ready for downstream parsing and visualization.
<span class="p">
---
</span>
<span class="gu">### Reflection</span>
Before returning the final JSON, perform the following validation steps:
<span class="p">
1.</span> <span class="gs">**Completeness check**</span> – ensure every <span class="sb">`message_part_id`</span> from the input appears in exactly one category.
<span class="p">2.</span> <span class="gs">**Representativeness check**</span> – verify that the categories and subcategories together capture the overall structure and intent of the conversation, aligned with the goal.
<span class="p">3.</span> <span class="gs">**Domain integrity check**</span> – confirm that terminology and phrasing reflect the conversation’s domain accurately, not abstract generalizations.
<span class="p">4.</span> <span class="gs">**Ground-level identification check**</span> – make sure ground-level material (for example: detailed lists, code, or data) is correctly placed in leaf categories.
<span class="p">5.</span> <span class="gs">**Empty-category check**</span> – remove or merge any category that has no assigned <span class="sb">`message_parts`</span> and no children with assigned parts. Categories without content are not useful.
<span class="p">6.</span> <span class="gs">**Final coherence review**</span> – confirm that summaries are accurate, hierarchy depth is sensible, and the map as a whole provides a faithful and navigable representation of the conversation.
<span class="p">
---
</span>
<span class="gu">### Output specification</span>
Return a <span class="gs">**JSON array**</span> of top-level categories.

[
  {
    "id": "checklist",
    "name": "Checklist of questions",
    "summary": "User’s structured audit checklist.",
    "message_parts": ["mp_12", "mp_13"],
    "children": [
      {
        "id": "checklist.data_model",
        "name": "Data model checks",
        "summary": "Questions about schema alignment.",
        "message_parts": ["mp_14", "mp_15"]
      },
      {
        "id": "checklist.security",
        "name": "Security and authentication checks",
        "summary": "Questions related to authentication and authorization.",
        "message_parts": ["mp_16"]
      }
    ]
  },
  {
    "id": "analysis.synthesis",
    "name": "Analysis and synthesis",
    "summary": "Assistant’s reasoning and conclusions.",
    "message_parts": ["mp_20", "mp_22"]
  },
  {
    "id": "files.reads",
    "name": "File readings and tool calls",
    "summary": "Assistant’s inspection of repository files using tools.",
    "message_parts": ["mp_30", "mp_31"]
  }
]
</code></pre></div>  </div>

</details>

<p>What didn’t work? That reflection list in the prompt is a good list of things that failed!</p>

<p>It wouldn’t return all message parts, some parts would remain without a category, some categories would be created without the parts. Sometimes the message part ids would be hallucinated. And of course, all of this took a long time.</p>

<p>Following the same method, I thought about what the model is actually good at, and went with this one-line prompt:</p>

<div class="wrap-code">

  <div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>given this conversation, give me a list of categories that shows how information is organized. just give me a list of categories in a json array

<span class="gu">## Result</span>
[
  "Problem statement and goals",
  "Customer personas and jobs-to-be-done",
  "Functional and data requirements",
  "Technical architecture and stack",
  "Data modeling and schema design",
  "Authentication and security",
  "Directory, person detail, and history",
  "Import and export processes",
  "Analytics and reporting",
  "Audit logging and event tracking",
  "Testing and CI/CD",
  "Performance, scalability, and operations",
  "Open questions and risks",
  "Repository and code organization"
]
</code></pre></div>  </div>
</div>

<p>These weren’t the categories I wanted. I realised I didn’t want categories that represented the information in text, I was looking for <em>components</em> of the conversation, because with context-viewer, I was not looking to analyse the conversation itself, but <em>how</em> the conversation happened. So I changed that one part:</p>

<div class="wrap-code">

  <div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>given this conversation, give me a list of components it comprises of. just give me a list in a json array

<span class="gu">## Result</span>
[
  "product_requirements_document",
  "technical_specification_document",
  "repository_structure",
  "repository_context",
  "task",
  "sources",
  "breakdown",
  "reflection"
]
</code></pre></div>  </div>

</div>

<p>That worked! While no output specification was necessary, I tweaked it a little as I experimented with some more input files.</p>

<div class="wrap-code">

  <div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>given this conversation, give me a list of all its components
each component can be 3 to 4 words in length

just give me a list in a json array like this example:
["abc_document", "xyz_structure", "foo_context", "task", "sources", "breakdown", "reflection", "files_about_bar", "files_about_baz", "tool_calls_about_quix", "xyz blocks", "pqr list"]
</code></pre></div>  </div>

</div>

<p>And then I used another prompt to get the category mapping for each message part, which worked on the first try.</p>

<div class="wrap-code">

  <div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>given this conversation and the list of components, give me a mapping of message part ids in the conversation, to a component from the list, for all the message parts

just give me a simple json object {id: component}

<span class="nt">&lt;conversation&gt;</span>${conversationJson}<span class="nt">&lt;/conversation&gt;</span>
<span class="nt">&lt;components&gt;</span>${componentsJson}<span class="nt">&lt;/components&gt;</span>

<span class="gu">## Result</span>
{
  "24.5": "product_requirements_document",
  "24.6": "technical_specification_document",
  "24.7": "repository_structure",
  "26": "assistant_reasoning_blocks",
  "27": "assistant_reasoning_blocks",
  "28": "assistant_reasoning_blocks",
  "29": "assistant_reasoning_blocks",
  ...
</code></pre></div>  </div>

</div>

<div style="border-left:4px solid #FF3D84; background:#fafafa; padding:0.75em 1em; border-radius:4px; margin:1em 0;">

  <h3 id="summary-1">Summary:</h3>

  <p>Before:</p>

  <ul>
    <li>AI: A single prompt to identify hierarchical categories, assign categories to message parts, and return the final mapping.</li>
  </ul>

  <p>After:</p>

  <ul>
    <li>AI: Identify components</li>
    <li>AI: Assign components to message part ids</li>
    <li>Engineering: Some basic JSON merging</li>
  </ul>

</div>

<hr />
<p>Overall I’m glad that the prompts I needed in the end were tiny. I think that’s a signal that I’m using LLMs correctly. Just needed to break the problem down, let the model do what it’s good at, and augment it with some good old fashioned engineering where needed.</p>
]]></content>
    
  </entry>
  
  <entry>
    
    <title type="html"><![CDATA[Fight context rot with context observability]]></title>
    <link href="http://blog.nilenso.com/blog/2025/10/29/fight-context-rot-with-context-observability/"/>
    
    <updated>2025-10-29T00:00:00+00:00</updated>
    <id>http://blog.nilenso.com/blog/2025/10/29/fight-context-rot-with-context-observability</id>
    <summary type="html"><![CDATA[TL;DR: You can’t fix what you can’t see. I built a tool that pulls apart LLM context into meaningful components that you can see, measure, and then meaningfully engineer.
]]></summary>
    
    <content type="html"><![CDATA[<p><em>TL;DR: You can’t fix what you can’t see. I built a tool that pulls apart LLM context into meaningful components that you can see, measure, and then meaningfully engineer.</em></p>

<h2 id="the-need-for-context-observability">The need for context observability</h2>

<p>Context engineering is one of the most important aspects of AI Engineering.</p>

<ul>
  <li><a href="https://x.com/karpathy/status/1937902205765607626">Andrej Karpathy:</a> “In every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window…”, and we need to “pack the context windows just right”.</li>
  <li><a href="https://www.dbreunig.com/2025/06/26/how-to-fix-your-context.html">Drew Breunig:</a> “…context is not free. Every token in the context influences the model’s behavior, for better or worse.”, and “The massive context windows of modern LLMs are a powerful capability, but they’re not an excuse to be sloppy with information management.”</li>
</ul>

<p>And yet, we don’t have the tools to really <em>see</em> the context, or pull it apart into <em>tangible</em> components we can analyse for relevance. That kind of observability of the context seems to be missing. If we want to fight the “Garbage In, Garbage Out” phenomenon, we should be able to observe the garbage in the context to prune it out.</p>

<p><a href="#on-existing-observability-tools">Existing observability tools</a> seem to be primarily focused on system metrics like latency, cost, error rates, or high-level agent tracing. They might show a flame-graph of requests and responses, but not what kind of content is crowding up the context window over time.</p>

<p>In his <a href="https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html">“How Long Contexts Fail”</a>, and <a href="https://www.dbreunig.com/2025/06/26/how-to-fix-your-context.html">“How to Fix Your Context”</a> posts, Drew Breunig details these failure mechanisms and methods to combat them.</p>

<p><img src="/images/blog/context-observability-image.png" alt="context-rot-problems" /></p>

<p>But we need to detect and diagnose these kinds of issues before we can solve for them. Hrm. If only we had a tool that dealt well with large unstructured text, segment and classify it, and also detect these issues.</p>

<h2 id="context-viewer">Context viewer</h2>

<p>So I built a tool to provide this kind of observability, and it is called <a href="http://github.com/nilenso/context-viewer">context-viewer</a> (I like pithy names). It is an open source tool I built over last week. It doesn’t have a server component, and is designed to work directly in the browser for convenience. Feel free to fork, tweak, raise PRs.</p>

<p>You drag-drop a conversation JSON log into it, and with some AI assistance, it pulls it apart into tangible components, and lets you see its growth over time. Here’s what it looks like for a run of <a href="https://github.com/nilenso/storymachine">StoryMachine</a>, which is currently a simple workflow.</p>

<p>For context, the initial prompt provides product requirements and a tech spec, then it reads the repository to get a relevant summary, and then breaks down the work into user stories in stages with some human feedback. There’s more details in <a href="#why-i-built-this-and-how-it-has-helped">a section below</a>.</p>

<p>You can see a list of components, and how they change and grow over time. Recency matters in context, so just looking at space won’t do, it has to be across time and space.</p>

<video src="/videos/components-view.mp4" controls="" preload="" autoplay="" loop="" muted=""></video>

<p>It visualises growth of components as counts and percentages of total token count.</p>

<video src="/videos/timeline-view.mp4" controls="" preload="" autoplay="" loop="" muted=""></video>

<p>And it also lets you filter, sort, search through messages in a conversation however you’d like. Not all messages have the same prominence even at the same token count. User messages that direct the conversation carry more weight.</p>

<video src="/videos/conversation-view.mp4" controls="" preload="" autoplay="" loop="" muted=""></video>

<p>Here’s the Github link again, if you want to try it out: <a href="https://github.com/nilenso/context-viewer/">nilenso/context-viewer</a></p>

<h2 id="how-it-works">How it works</h2>

<p>It’s a fairly simple process with a few steps.</p>

<ol>
  <li><strong>Parse the conversation</strong>: This is currently one of the open-ai formats of chat-completions / responses / conversations API logs. It’s easy enough to parse another format. You can also drop multiple conversations to process them in parallel. Interface has basic search, filtering and sorting too.</li>
  <li><strong>Count the tokens:</strong> It uses <a href="https://github.com/dqbd/tiktoken">dqbd/tiktoken</a>’s WASM bindings to count tokens per message in parallel.</li>
  <li><strong>Segment large messages:</strong> Messages over a certain threshold of tokens (which is 500, somewhat arbitrarily) get broken down into smaller pieces using AI. If there are multiple parts in a single prompt / message, the parts get broken down here. This is basically a dead-simple-single-prompt version of <a href="https://x.com/GregKamradt/status/1738276097471754735">semantic chunking</a>, lot of room to improve this.</li>
  <li><strong>Find components:</strong>  Given the entire conversation, an AI call identifies the components, and another call assigns components to individual messages. You can also tweak the prompts right in the UI, and iterate on the components until you get to a classification you’re happy with.</li>
  <li><strong>Visualise:</strong> The components view follows <a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents">anthropic’s visualisation</a> of context. There’s a time-slider that lets one traverse through how context fills up, and there’s also a simple stacked bar graph that shows growth over time.</li>
  <li><strong>Synthesize:</strong> Given the generated data of component growth over time, and a summary of the conversation, an AI can do a pretty great job of telling us about growth patterns good and bad, redundancy, and relevance. It might then also provide reasonable solutions since there’s sufficient light on the problem now.</li>
</ol>

<p>The <a href="https://github.com/nilenso/context-viewer/blob/main/src/prompts.ts">prompts</a> are very short descriptions of the goals, targeted at what the simplest model is very good at. While I have found these default prompts to be good, they’re meant to be iterated on to assist us in a process of finding the right segment sizes and the right dimension to split components with. Some might want much more granular components, some purposes need more domain specific categories, and some might prefer hierarchical categorisation with the ability to zoom in/out of them. Being able to slice and dice the data in a way that fits the needs is essential in observability.</p>

<p>Analysing a ~13k tokens conversation for all the above steps takes ~15s. A ~35k tokens conversation takes ~40s. I’d say it scales linearly. I didn’t care too much about performance or cost with this tool. But if I find that this is worth building out into something more full fledged and practical for large scale usage, then I’ll probably redesign parts of it. Suggestions welcome!</p>

<h2 id="why-i-built-this-and-how-it-has-helped">Why I built this, and how it has helped</h2>

<p>I ran into this need pretty quickly when working on <a href="https://github.com/nilenso/storymachine">StoryMachine</a>. StoryMachine is a fairly simple workflow currently. It takes a PRD, a tech spec, reads the codebase, and generates stories with some user feedback. Still, it’s a fair number of turns in a long conversation. And more importantly, the conversation involves iterating over artefacts of stories, and I knew that every iteration of a story would be adding the whole story to the context again, leading to redundancy.</p>

<p>I wanted to see this happening. So that I could verify that this is indeed a problem, and a big enough problem warranting an engineering solution. I also wanted this feedback loop for myself so I could prioritise my attention to truly important issues.</p>

<p>A few ways in which I immediately saw benefits:</p>

<ul>
  <li>I could “see” the conversation happen. A summary of what happened in the conversation, how many turns, tool calls etc, helped me quickly get a grip on what I’m analysing.</li>
  <li>The components it detected were “project_specifications”, and “product_specifications”. I had made a typo! And the LLM could have been looking at these as two different docs when they were actually the same doc. And I guess more importantly, this was duplicated context occupying about 13% of context window.</li>
  <li>The questions used to generate the relevant repository context are not relevant after that activity, but they remain in context unnecessarily. The AI generated analysis told me this, and I could see it visually too. This was about 2% in terms of size. But then again, even a small amount of poison could be lethal depending on the type of poison; size isn’t everything.</li>
  <li>And by the end of iterating through 2 stories with user feedback, because the full stories were being regenerated at every turn, it added up to 41%. This validated my hypothesis, and allowed me take it on as a real problem to solve.</li>
</ul>

<h2 id="about-using-ai-to-analyse">About using AI to analyse</h2>

<p>This is certainly a debatable design choice, so here are my thoughts around it:</p>

<ol>
  <li>It’s a bit annoying to use another AI, burning more tokens and money to come up with something that could be a first class concept in an AI SDK. If we segmented our prompts correctly, split apart using XML tags with attributes for identifying components, and also tag all tool calls and tool responses, then we should have a lot of the data already organised for observability. Similar to emitting rich product events from applications. I think this is a good idea worth exploring. DSPy’s <a href="">signatures</a> is a layer of abstraction that’s well suited for this sort of annotation at the request side. However, the responses from LLMs that also go into context do need to be broken down, and we don’t really control that.</li>
  <li>The model I’ve been using so far is gpt-4o-mini, which is 2 generations old, but I use it because it is fast, and produces good enough results. I bet one could use a gpt-oss model, or some other local tiny model to similar results. That way this analysis could be fast, and fully private too.</li>
  <li>Semantic chunking, and identifying components can happen across any arbitrary dimension. It is essential to have a way to modify the process to be better suited to each conversation / product / workflow’s needs. Using simple AI prompts here is a nice method that permits this. context-viewer lets you change the prompts and re-run the analysis so you can iteratively get the observability you need.</li>
  <li>At scale, some of this would not be feasible anymore, especially inside a browser tab. If one wanted to analyse 1000 conversations of a kind, running through 10s of AI calls per conversation isn’t sustainable or even meaningful. We’d have to engineer around it. Perhaps one can use this kind of a tool to explore and figure out how they want to analyse their conversations, and then engineer a custom data pipeline that scales, perhaps without AI. Semantic chunking methods that are not LLM based could be one way to go.</li>
</ol>

<h2 id="on-existing-observability-tools">On existing observability tools</h2>

<p>The LLM observability space is large with lots of players in the field like Braintrust, Helicone, Langfuse, Arize, WhyLabs, Fiddler, Evidently, etc. Their focus <a href="https://chatgpt.com/share/6900eed2-01e4-800a-9106-880e1e351144">seems to be largely on</a>:</p>

<ol>
  <li><strong>Logging and tracing:</strong> see input, output, spans, traces, latencies, flame graphs, timelines for agents, etc.</li>
  <li><strong>Token Usage &amp; Cost Monitoring:</strong> observability for optimising unit economics, cost per request, performance, efficiency.</li>
  <li><strong>Latency, Throughput &amp; Error Rate Monitoring:</strong> more traditional engineering systems observability features to help keep systems reliable.</li>
  <li><strong>Anomaly and outlier detection</strong>: To improve the effectiveness of above metrics</li>
  <li><strong>ML Observability:</strong> Data and concept drift detection in input, output, or features</li>
  <li><strong>Output Quality Evaluation:</strong> Broad relevance, quality, success scores</li>
</ol>

<p>Not that these aren’t important, but they don’t give the kind of feedback I’m looking for on my context. I find that almost all of these tools treat the context as a singular entity, and don’t really break them down for analysis.</p>

<p>Further, the kind of analysis I’m referring to here feels more domain, or product specific. Something I’d expect to find in a Mixpanel / Statsig dashboard, not in a datadog dashboard. The analysis of events is however still not similar to the way product metrics are usually analysed.</p>

<p>LangChain’s <a href="https://blog.langchain.com/insights-agent-multiturn-evals-langsmith/">insights agent</a> that came out last week comes closest to what I’m looking for.</p>

<hr />

<h2 id="try-it-out">Try it out</h2>

<ul>
  <li>Go to <a href="https://github.com/nilenso/context-viewer/">nilenso/context-viewer</a>, and clone it</li>
  <li>Add AI API keys in environment variables as <code class="language-plaintext highlighter-rouge">README.md</code> says, build and run it.</li>
  <li>
    <p>Export your context as a JSON file. Honestly, this isn’t as straightforward as I would like for it to be. Currently I’ve built support for Open AI formats:</p>

    <ul>
      <li>For responses format you can use the <a href="https://platform.openai.com/docs/api-reference/responses/input-items">input-items</a> API</li>
      <li>For conversations format you can use <a href="https://platform.openai.com/docs/api-reference/conversations/list-items">list-items</a> API</li>
      <li>For chat-completions, I used an internal dashboard API: <a href="https://api.openai.com/v1/dashboard/chat/completions/chatcmpl-CTqZR1lcRYXtUTRJPVLRzAWKCZpCN"><code class="language-plaintext highlighter-rouge">https://api.openai.com/v1/dashboard/chat/completions/chatcmpl-</code></a><code class="language-plaintext highlighter-rouge">id</code></li>
      <li>For other formats, adding another parser is just <a href="https://github.com/nilenso/context-viewer/blob/main/docs/prompts.md#support-conversation-format">a prompt</a> away.</li>
    </ul>
  </li>
  <li>Drag-drop it into context-viewer on the browser</li>
</ul>

<p>I had a lot of fun building this, and I plan to write about that experience. Meanwhile, here’s an end-to-end demo at 2x:</p>

<video src="/videos/full-cv-demo-2x.mp4" controls="" preload="" autoplay="" loop="" muted=""></video>

<p>I’d love to hear what you think. Join the discussion on <a href="https://news.ycombinator.com/item?id=45760437">Hacker News</a>! Contributions welcome through PRs. You can also tweet <a href="https://x.com/nilenso">@nilenso</a>, or email us at <a href="mailto:hello@nilenso.com">hello@nilenso.com</a> to reach us.</p>
]]></content>
    
  </entry>
  
  <entry>
    
    <title type="html"><![CDATA[Artisanal shims for the bitter lesson age]]></title>
    <link href="http://blog.nilenso.com/blog/2025/10/14/bitter-lesson-applied-ai/"/>
    
    <updated>2025-10-14T00:00:00+00:00</updated>
    <id>http://blog.nilenso.com/blog/2025/10/14/bitter-lesson-applied-ai</id>
    <summary type="html"><![CDATA[tldr? Your AI product must “price in” the knowledge of Sutton’s Bitter Lesson.
]]></summary>
    
    <content type="html"><![CDATA[<p><em>tldr? Your AI product must “price in” the knowledge of Sutton’s Bitter Lesson.</em></p>

<hr />

<p>Everyone is talking about Richard Sutton’s Bitter Lesson once again<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>.</p>

<blockquote>
  <p>The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most  effective, and by a large margin.</p>

  <p>Rich Sutton, <a href="http://www.incompleteideas.net/IncIdeas/BitterLesson.html">The Bitter Lesson</a></p>

</blockquote>

<p>I highly recommend taking a detour to read Richard’s essay if you haven’t yet, before coming back to this page. It’s very short.</p>

<p>Here’s my observation: The bitter lesson applies to developers building and working with AI applications as well—and many have not yet digested the bitter lesson.</p>

<h2 id="how-not-to-code-with-ai">How not to code with AI</h2>

<p>I’ve observed a type of AI-maximalist programmer often found at vibe coding events, workshops and demos. Their setup often has a folder full of text files that describe “rules”, “modes”, “roles”, prompts, or subagents. It often looks like a dump of all possible individual actions a developer can take—PRD analyser, planner, user story writer, code reviewer, UAT tester, etc. These files are full of long instructions, with lots of “pleading” language (or threats), capitalisation and even step-by-step logic telling an LLM how it should think and act.</p>

<p>The fundamental error in the above methods is that they bake in assumptions of what a workflow should look like, and how the agent should operate. They meddle with the model’s behaviour. It is what Sutton would describe as a “human knowledge based” method.</p>

<p>Some of these tricks were necessary when the models were weaker and less agentic. Today, they can reason well and learn from feedback in the environment. Force-fitting a complex web of workflows and roles is potentially <a href="https://x.com/dbreunig/status/1965855381529436639">fighting against the model weights</a>.</p>

<p>The engineer that has digested the bitter lesson will instead <a href="https://simonwillison.net/2025/Sep/30/designing-agentic-loops/">set up an environment that can provide feedback loops to the agent</a>. This setup is simpler and better accommodates frontier reasoning models that are scaled with reinforcement learning by getting out of their way.</p>

<h2 id="how-not-to-build-llm-wrappers">How not to build LLM wrappers</h2>

<p>I have observed engineers directly jump to complex workflows, indiscriminate application of prompting tricks and multiple agents with fixed roles when designing an LLM-integrated application. These add unnecessary complexity and should not be the default starting point. To better illustrate why, we can look at how coding agents have evolved over time.</p>

<p>The first generation of AI coding tools (Cursor, Sourcegraph Cody, Codeium<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>, Copilot) heavily relied on chunk-and-embed paradigm, ie, use a separate vector embeddings storage layer that prefills retrieved chunks into the LLM’s context window. <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></p>

<p>Newer AI tools (Cline, Windsurf, Amp, Claude Code, Codex, OpenHands) eschew pre-filled retrievals in favour of agentic search—ie, tell the AI how to invoke a search, and let it figure it out from there. How the search is performed is an implementation detail. This is a much simpler fundamental architecture. <sup id="fnref:details" role="doc-noteref"><a href="#fn:details" class="footnote" rel="footnote">4</a></sup></p>

<p>The latter approach better embodies the bitter-lesson. Do not bake in your human knowledge assumptions by prefilling items into the agent’s context window.</p>

<p>Reinforcement learning produces goal-seeking agents. Anyone who has digested the bitter lesson knows that more compute is being poured into these LLMs to make goal-seekers (they get a reward signal when they achieve their goal). Leverage this fact. As models get better at goal-seeking in general, they will get better inside applications that mirror this action → feedback loop.</p>

<p>We can generalise this for most LLM-enabled applications.</p>

<p>Let’s contrast some human-knowledge driven “artisanal” architectures against more “bitter-lessoned” architectures which could represent two ends of a spectrum.</p>

<table>
  <thead>
    <tr>
      <th>Artisanal architecture</th>
      <th>Bitter-lessoned architectures</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Prescriptive workflows</td>
      <td>Take actions, respond to feedback in a loop</td>
    </tr>
    <tr>
      <td>Prefilling tokens into prompts</td>
      <td>Giving models an objective and some tools</td>
    </tr>
    <tr>
      <td>Stages and modes</td>
      <td>Modeless</td>
    </tr>
    <tr>
      <td>embed-and-chunk</td>
      <td>Agentic search</td>
    </tr>
    <tr>
      <td>Makes assumptions about how a model should operate and think</td>
      <td>Sets up an environment and context that a model can verify itself against</td>
    </tr>
    <tr>
      <td>Imperative</td>
      <td>Declarative</td>
    </tr>
    <tr>
      <td>Specialised tool interfaces</td>
      <td>Code execution</td>
    </tr>
  </tbody>
</table>

<h2 id="signals-affirming-the-bitter-lesson-influencing-application-design">Signals affirming the bitter lesson influencing application design</h2>

<ul>
  <li>GPT-5 Codex resulted in <a href="https://github.com/openai/codex/blob/main/codex-rs/core/prompt.md">a new system prompt</a> in Codex CLI. It shrunk from ~300 lines to ~100 lines.</li>
  <li><a href="https://github.com/marckrenn/cc-mvp-prompts/compare/v1.0.128...v2.0.0">The Claude Code prompt</a> also shrank. The multi-edit tool call was removed, further simplifying the program.</li>
  <li>Claude Code creators <a href="https://www.latent.space/i/163091105/memory-and-the-future-of-context">actively avoided prefilled context and embed-and-chunk</a>, directly citing the bitter lesson.</li>
  <li>Cloudflare introduced <a href="https://blog.cloudflare.com/code-mode/">code mode</a>, where it rewrites MCP tool interfaces into typescript interfaces, because the LLM is much more competent at writing and composing code. While providing tools are more bitter lessoned than prefilled context windows, I thought this goes a step further.</li>
</ul>

<h2 id="when-to-use-artisanal-architectures">When to use artisanal architectures</h2>

<p>This is not to say that artisanal architectures are bad. It’s that artisanal architectures must account for the bitter lesson.</p>

<p>When the model isn’t good at your task yet, but may get there eventually under the current scaling regime—design an artisanal architecture to build what is needed today, but do so with the knowledge that some day you may have to throw this functionality away—make the artisanal parts especially easy to remove when the bitter lesson inevitably strikes. <sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">5</a></sup></p>

<p>A more permanently artisanal architecture also makes sense when your task does not require a repeated sequence of actions and deep thinking, for example, a classification task in a pipeline or a task to link similar address records.</p>

<h2 id="make-a-note-of-what-is-not-scaling-with-compute">Make a note of what is <em>not</em> scaling with compute</h2>

<p>With current scaling methods verifiable tasks with clear goals will continue to improve: coding, searching, mathematics. Leave the methods of achieving the goal to the agent.</p>

<p>Current training methods have also not scaled context window sizes as reliably—so you might want to hold on to subagents and context-compaction tricks.</p>

<p>Training methods will also not solve for important parts that are gluing things together, like retries and reliable execution, or good interface design.</p>

<h2 id="summary">Summary</h2>

<ul>
  <li>Make a note of what aspects of models will improve with the current scaling regime.</li>
  <li>Account for the scaling in your AI application—avoid human-informed workflows in lieu of action-infer-feedback loops.</li>
  <li>Artisanal architectures and methods may be necessary for many applications—but do so with the knowledge that you might have to throw it away when the next frontier model drops.</li>
  <li>Some things won’t scale with compute—most of your artisanal ingenuity may be required here.</li>
</ul>

<h2 id="similar-articles">Similar articles</h2>

<ul>
  <li><a href="https://rlancemartin.github.io/2025/07/30/bitter_lesson/">Learning the Bitter Lesson</a></li>
  <li><a href="https://vintagedata.org/blog/posts/model-is-the-product">The Model is the Product</a></li>
</ul>

<hr />

<p><em>Thanks to Srihari and Ravi Chandra Padmala for reviewing drafts of this.</em></p>

<hr />

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Mostly thanks to the recent <a href="https://www.youtube.com/watch?v=21EYKqUsPfg">Dwarkesh Podcast</a>. I am aware that even though this whole article is going to be LLM-centric, Sutton himself does not believe LLMs are the most “bitter lesson-pilled” AI architecture. But I believe it’s fair to say that there’s a spectrum to the bitter lesson and LLMs are definitely less human-knowledge based than other generalist AI architectures. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>Before Codeium became the more agentic windsurf. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>There’s also Aider—while it is not using embeddings, inserts a repomap into the LLM context, which makes it a form of prefilling. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:details" role="doc-endnote">
      <p>The devil of course is in the details and the elbow grease. And also the parts which actually do not improve even when AI scaling continues. While the fundamental architecture is <em>simple</em>, it’s not necessarily <em>easy</em> to nail down all the details. But artisanal architectures are neither simple nor easy. <a href="#fnref:details" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p>In some instances, you might see a hop from one type of artisanal workflow to another, due to a model improving, but still not improving enough to remove the need for a human knowledge informed method, as was the case with <a href="https://cognition.ai/blog/devin-sonnet-4-5-lessons-and-challenges">Devin and Sonnet 4.5</a>. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>
]]></content>
    
  </entry>
  
  <entry>
    
    <title type="html"><![CDATA[What are popular AI coding benchmarks actually measuring?]]></title>
    <link href="http://blog.nilenso.com/blog/2025/09/25/swe-benchmarks/"/>
    
    <updated>2025-09-25T00:00:00+00:00</updated>
    <id>http://blog.nilenso.com/blog/2025/09/25/swe-benchmarks</id>
    <summary type="html"><![CDATA[I dug into popular coding benchmarks while building StoryMachine, an experiment in breaking down software tasks into agent-executable units.
]]></summary>
    
    <content type="html"><![CDATA[<p>I dug into popular coding benchmarks while building <a href="https://blog.nilenso.com/blog/2025/09/15/ai-unit-of-work/#the-storymachine-experiment">StoryMachine</a>, an experiment in breaking down software tasks into agent-executable units.</p>

<p>They measure something narrower than what their names suggest. In general, they are <a href="https://blog.nilenso.com/blog/2025/09/15/ai-unit-of-work/#the-right-sized-unit-of-work-controls-the-propagation-of-errors">a lot less messy</a> than how we write software. This is why Claude scoring 80% on SWE-bench does not translate to Claude one-shotting 80% of the things I throw at it.</p>

<p>Let’s look at what these benchmarks are actually measuring<sup id="fnref:selection" role="doc-noteref"><a href="#fn:selection" class="footnote" rel="footnote">1</a></sup>.</p>

<h2 id="table-of-contents">Table of Contents</h2>

<ul>
  <li><a href="#swe-bench-verified-and-swe-bench-pro">SWE-bench Verified and SWE-bench Pro</a>
    <ul>
      <li><a href="#what-it-measures">What it measures</a></li>
      <li><a href="#the-specifics">The specifics</a></li>
      <li><a href="#verdict">Verdict</a></li>
    </ul>
  </li>
  <li><a href="#aider-polyglot">Aider Polyglot</a>
    <ul>
      <li><a href="#what-it-measures-1">What it measures</a></li>
      <li><a href="#the-specifics-1">The specifics</a></li>
      <li><a href="#verdict-1">Verdict</a></li>
    </ul>
  </li>
  <li><a href="#livecodebench">LiveCodeBench</a>
    <ul>
      <li><a href="#what-it-measures-2">What it measures</a></li>
      <li><a href="#the-specifics-2">The specifics</a></li>
      <li><a href="#verdict-2">Verdict</a></li>
    </ul>
  </li>
  <li><a href="#other-benchmarks">Other benchmarks</a></li>
  <li><a href="#benchmarking-is-hard-and-this-makes-me-bullish-on-coding-agents">Benchmarking is hard and this makes me bullish on coding agents</a></li>
  <li><a href="#references-and-further-reading">References and further reading</a></li>
</ul>

<h2 id="swe-bench-verified-and-swe-bench-pro">SWE-bench Verified and SWE-bench Pro</h2>

<h3 id="what-it-measures">What it measures</h3>

<p>How well a coding agent can submit a patch for a real-world GitHub issue that passes the unit tests for that issue.</p>

<h3 id="the-specifics">The specifics</h3>

<p>There are many variants: Full, Verified, Lite, Bash-only, Multimodal. Most labs in their chart report on <strong>SWE-bench Verified</strong>, which is a cleaned and human-reviewed subset.</p>

<p>Notes and quirks of SWE-bench Verified:</p>

<ul>
  <li>It has 500 problems, all in Python. <a href="https://epoch.ai/blog/what-skills-does-swe-bench-verified-evaluate">Over 40% are issues from the Django source repository</a>; the rest are libraries. Web applications are entirely missing. The repositories that the agents have to operate are real, hefty open source projects.</li>
  <li>Solutions to these issues are small—think surgical edits or small function additions. The mean lines of code per solution are 11, and median lines of code are 4. Amazon found that <a href="https://arxiv.org/pdf/2504.08703">over 77.6% of the solutions touch only one function</a>.</li>
  <li>All the issues are from 2023 and earlier. This data was almost certainly in the training sets. Thus it’s hard to tell how much of the improvements are due to memorisation.</li>
</ul>

<p><img src="/images/blog/image.png" alt="METR Bar Chart showing SWE-Bench Verified repository distribution" /></p>

<p>Recently, Scale AI published an improved version called <strong>SWE-bench Pro</strong> that tries to address some quirks of Verified. Here are my notes:</p>

<ul>
  <li>1865 problems, from 41 repositories. It’s a mix of copyleft/GPL public repositories and some private repositories. I’m skeptical that choosing copyleft would meaningfully combat contamination, given AI labs <a href="https://www.bbc.com/news/articles/c5y4jpg922qo">are known to resort to piracy to train their models</a>. But this is still an improvement.</li>
  <li>The repositories are Python, Go, JS and TS—we don’t know the distribution, except that they are “not uniform”. But they also ensure every repository has only 50-100 problem instances.</li>
  <li>They claim to sample repositories “from a diverse range of topics, including consumer applications with complex UI logic, B2B platforms with intricate business rules, and developer tools with sophisticated APIs”—much better than SWE-Bench Verified. That said, I could not find what this distribution looks like.</li>
  <li>The solutions have a mean of 107 lines of code, and a median of 55 lines of code, and usually span an average of 4 files. Also good.</li>
  <li>They actually got humans to rewrite problems based on issues, commits and PRs to ensure there’s no missing information. They also added “requirements [that] are grounded on the unit tests that are used for validation”. They may also add interface code for some problems.</li>
  <li>They also have dockerized environments set up with all the dependencies installed and configured, so the benchmark explicitly does not test if your agent can setup your repository.</li>
</ul>

<details style="margin-top:1em; background-color:#EEE;">
  <summary><strong>An example problem statement from SWE-Bench Pro</strong></summary>

  <article style="padding:12px 16px;">
    <h3>Title: Email Validation Status Not Handled Correctly in ACP and Confirmation Logic</h3>

<p><strong>Description</strong></p>
<p>
  The Admin Control Panel (ACP) does not accurately reflect the email validation status of users.
  Also, validation and confirmation processes rely on key expiration, which can prevent correct
  verification if the keys expire. There’s no fallback to recover the email if it’s not found under
  the expected keys. This leads to failures when trying to validate or re-send confirmation emails.
</p>

<p><strong>Steps to reproduce</strong></p>
<ol>
  <li>Go to ACP → Manage Users.</li>
  <li>Create a user without confirming their email.</li>
  <li>Attempt to validate or resend confirmation via ACP after some time (allow keys to expire).</li>
  <li>Observe the UI display and backend behavior.</li>
</ol>

<p><strong>What is expected</strong></p>
<ul>
  <li>Accurate display of email status in ACP (validated, pending, expired, or missing).</li>
  <li>Email confirmation should remain valid until it explicitly expires.</li>
  <li>Validation actions should fallback to alternative sources to locate user emails.</li>
</ul>

<p><strong>What happened instead</strong></p>
<ul>
  <li>Expired confirmation keys prevented email validation.</li>
  <li>The email status was unclear or incorrect in ACP.</li>
  <li><q>Validate</q> and <q>Send validation email</q> actions failed when the expected data was missing.</li>
</ul>

<p><strong>Requirements</strong></p>
<ul>
  <li>
    The <code>loadUserInfo(callerUid, uids)</code> function should include logic to retrieve and attach
    <code>email:pending</code> and <code>email:expired</code> flags to each user object. These flags must be
    derived by resolving <code>confirm:byUid:&lt;uid&gt;</code> keys via the new <code>getConfirmObjs()</code>
    function and checking <code>expires</code> timestamps in corresponding <code>confirm:&lt;code&gt;</code> objects.
  </li>
  <li>
    The <code>getConfirmObjs()</code> helper within <code>loadUserInfo()</code> should fetch confirmation codes using
    <code>db.mget()</code> on <code>confirm:byUid:&lt;uid&gt;</code> keys, then retrieve the corresponding
    <code>confirm:&lt;code&gt;</code> objects using <code>db.getObjects()</code>. The mapping must ensure each user’s
    confirmation object is accurately indexed by position.
  </li>
  <li>
    Each database adapter MongoDB, PostgreSQL, and Redis, must implement a
    <code>db.mget(keys: string[]): Promise&lt;string[]&gt;</code> method in their respective <code>main.js</code>
    files. This method takes an array of keys and returns an array of corresponding string values.
  </li>
  <li>
    The <code>db.mget</code> implementation should return <code>null</code> for any keys not found. For Redis, use
    <code>client.mget</code>. For MongoDB, query the objects collection with <code>{ _key: { $in: keys } }</code>.
    For PostgreSQL, join <code>legacy_object_live</code> and <code>legacy_string</code> to retrieve values by key.
  </li>
  <li>
    All adapters must preserve input key order and explicitly return <code>null</code> for missing keys.
  </li>
  <li>
    <code>User.validateEmail</code> should retrieve the user’s email using
    <code>user.email.getEmailForValidation(uid)</code> before calling <code>user.email.confirmByUid(uid)</code>.
    If a valid email is found, save it with <code>user.setUserField(uid, 'email', email)</code>.
  </li>
  <li>
    <code>User.sendValidationEmail</code> must use <code>user.email.getEmailForValidation(uid)</code> and pass the
    email explicitly to <code>user.email.sendValidationEmail</code>.
  </li>
  <li>
    When a user account is deleted, invoke <code>User.email.expireValidation(uid)</code> to remove any pending email
    confirmation data.
  </li>
  <li>
    When generating a new confirmation entry <code>confirm:&lt;code&gt;</code>, store an
    <code>expires</code> field as a Unix timestamp in milliseconds in the confirmation object, not a DB-level TTL.
    This timestamp must be used for all future expiry checks.
  </li>
  <li>
    <code>User.email.getEmailForValidation(uid)</code> must first try <code>user:&lt;uid&gt;</code>. If no email is
    set, fallback to the email in <code>confirm:&lt;code&gt;</code> referenced by
    <code>confirm:byUid:&lt;uid&gt;</code>. Only return the email if the UID matches.
  </li>
  <li>
    <code>User.email.isValidationPending(uid, email)</code> must return <code>true</code> only if the confirmation
    object exists, the current time is before <code>expires</code>, and if provided, the email matches.
  </li>
  <li>
    In <code>User.email.canSendValidation(uid, email)</code>, compare the stored TTL timestamp if available
    (or current time if unavailable) plus the configured interval against the max confirmation period to prevent
    excessive resends.
  </li>
</ul>

<p><strong>New interfaces introduced</strong></p>

<p>
  <strong>Type:</strong> Method<br />
  <strong>Name:</strong> <code>db.mget</code><br />
  <strong>Path:</strong> <code>src/database/mongo/main.js</code>, <code>src/database/postgres/main.js</code>, <code>src/database/redis/main.js</code><br />
  <strong>Input:</strong> <code>keys: string[]</code><br />
  <strong>Output:</strong> <code>Promise&lt;(string | null)[]&gt;</code><br />
  <strong>Description:</strong> A batch retrieval method on the database abstraction layer.
</p>

<p>
  <strong>Type:</strong> Function<br />
  <strong>Name:</strong> <code>user.email.getEmailForValidation</code><br />
  <strong>Path:</strong> <code>src/user/email.js</code><br />
  <strong>Input:</strong> <code>uid: number</code><br />
  <strong>Output:</strong> <code>Promise&lt;string | null&gt;</code><br />
  <strong>Description:</strong> Returns the most appropriate email for admin actions like force validate or resend.
</p></article>
</details>

<h3 id="verdict">Verdict</h3>

<p>Overall, I think SWE-bench is a good, if still very flawed benchmark (most other benchmarks are a lot worse). I also think SWE-bench Pro addresses some severe problems with Verified (which at this point should just be ignored in any frontier model report). I’ll note that there’s significant drift from what this measures and how I actually work with AI coding agents.</p>

<p>SWE-Bench is measuring how well AI performs on well-defined units of work. So when we say that an agent scores 25% in SWE-bench Pro, we are saying: “In a problem set of well-defined issues with pointed requirements and (the occasional) specification of code interfaces, 25% of the solutions from the agent get the respective problem’s unit test cases to pass”.</p>

<p>This is a useful measurement of progress. But this is not SWE as I understand it—most of the high-leverage parts are in working with product owners to come up with a good specification, translate them into useful interfaces, and then writing secure, maintainable code. With this benchmark we do not have any idea if the code is maintainable, secure, provably correct, or well-crafted—we just know that the unit test cases for it will pass<sup id="fnref:utboost" role="doc-noteref"><a href="#fn:utboost" class="footnote" rel="footnote">2</a></sup>.</p>

<h2 id="aider-polyglot">Aider Polyglot</h2>

<h3 id="what-it-measures-1">What it measures</h3>

<p>If the coding agent (specifically, Aider) can solve hard-level Exercism problems and apply file edits that pass unit tests after at most one round of feedback.</p>

<h3 id="the-specifics-1">The specifics</h3>

<p>Exercism is a learning platform with “kata-style” programming exercises. It’s not as algorithmic as LeetCode, but still pretty contained.</p>

<details style="margin-bottom:1em; background-color:#EEE;">
  <summary><strong>Example input for Aider Polyglot</strong></summary>
  <article style="padding:12px 16px;">
  <p><strong>Instructions</strong></p>
  <p>
    Your task is to implement bank accounts supporting opening/closing, withdrawals, and deposits of money.
  </p>
  <p>
    As bank accounts can be accessed in many different ways
    (internet, mobile phones, automatic charges), your bank software must
    allow accounts to be safely accessed from multiple threads/processes
    (terminology depends on your programming language) in parallel.
    For example, there may be many deposits and withdrawals occurring in
    parallel; you need to ensure there are no
    <a href="https://en.wikipedia.org/wiki/Race_condition#In_software" target="_blank" rel="noopener">race conditions</a>
    between when you read the account balance and set the new balance.
  </p>
  <p>It should be possible to close an account; operations against a closed account must fail.</p>

  <p><strong>Starting point file: <code>src/main/java/BankAccount.java</code></strong></p>
<div class="language-java highlighter-rouge">
<div class="highlight">
  <pre class="highlight"><code>
class BankAccount {
    void open() throws BankAccountActionInvalidException {
        throw new UnsupportedOperationException("Delete this statement and write your own implementation.");
    }


    void close() throws BankAccountActionInvalidException {
        throw new UnsupportedOperationException("Delete this statement and write your own implementation.");
    }

    synchronized int getBalance() throws BankAccountActionInvalidException {
        throw new UnsupportedOperationException("Delete this statement and write your own implementation.");
    }

    synchronized void deposit(int amount) throws BankAccountActionInvalidException {
        throw new UnsupportedOperationException("Delete this statement and write your own implementation.");
    }

    synchronized void withdraw(int amount) throws BankAccountActionInvalidException {
        throw new UnsupportedOperationException("Delete this statement and write your own implementation.");
    }
}
</code></pre>
</div>

</div>
</article>
</details>

<p>It seems to have far more language diversity than most other popular benchmarks: C++, Java, Go, Python, JavaScript and Rust are covered. The more functional programming languages are still unrepresented.</p>

<table>
  <thead>
    <tr>
      <th>Language</th>
      <th>Problems</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>C++</td>
      <td>26</td>
    </tr>
    <tr>
      <td>Go</td>
      <td>39</td>
    </tr>
    <tr>
      <td>Java</td>
      <td>47</td>
    </tr>
    <tr>
      <td>JavaScript</td>
      <td>49</td>
    </tr>
    <tr>
      <td>Python</td>
      <td>34</td>
    </tr>
    <tr>
      <td>Rust</td>
      <td>30</td>
    </tr>
    <tr>
      <td><strong>Total</strong></td>
      <td><strong>225</strong></td>
    </tr>
  </tbody>
</table>

<p>Glancing at individual examples, it seems to me that most solutions are in the range of 30-200 lines of code, and spanning at most 2 files.</p>

<p>Like SWE-bench, the evaluation is based on how many unit test cases pass. Everything runs on the Aider harness and prompts.</p>

<h3 id="verdict-1">Verdict</h3>

<p>This is a benchmark that will tell you how good a model is at solving small, tight and well-defined problems. It’s a good measure to check how well a model will perform on Aider across a range of languages. But this is nowhere near a benchmark for SWE (nor does it claim to be). Like SWE-bench, it also only checks unit test case pass rate, which does not account for many aspects of correctness and software quality.</p>

<h2 id="livecodebench">LiveCodeBench</h2>

<h3 id="what-it-measures-2">What it measures</h3>

<p>Python competitive-programming skills under hidden test suites with a rolling, “fresh” problem set. Think LeetCode.</p>

<h3 id="the-specifics-2">The specifics</h3>

<ul>
  <li>
    <p>It consists of the following tasks:</p>

    <ul>
      <li>Generate solutions to competitive coding problems from scratch.</li>
      <li>Fix incorrect solutions to coding problems.</li>
      <li>Predict the output of a function given the code (weird but why not!)</li>
      <li>Given only the problem statement and test input, predict the output (okay?)</li>
    </ul>
  </li>
  <li>Everything is Python.</li>
  <li>There’s a fairly even balance between Easy, Medium and Hard problems.</li>
  <li>The evaluations are just like LeetCode: run hidden test cases.</li>
  <li>
    <p>We know there’s little contamination—only problems released after each model’s cutoff date are evaluated.</p>

    <ul>
      <li>That said, because it’s LeetCode style, lots of problems will look quite similar to each other.</li>
    </ul>
  </li>
</ul>

<h3 id="verdict-2">Verdict</h3>

<p>This isn’t a SWE benchmark. This will tell you how good a model is at for solving LeetCode style Python problems, along with a mix of some slightly unusual skills like “mental execution” of code and test case output prediction.</p>

<h2 id="other-benchmarks">Other benchmarks</h2>

<p><strong>TerminalBench</strong>: This is interesting because it exclusively focuses on terminal use. So SWE-Bench paired with TerminalBench will give a broader picture of SWE-like capabilities.</p>

<p><strong>SWE-Lancer</strong>: OpenAI released this earlier in the year, and I thought it was neat because it directly maps the work to economic value by getting the agents to work on Expensify/Upwork tasks. The validation comes from E2E tests rather than unit tests. Unfortunately, their reporting of this benchmark has been quite lowkey since. And nobody is running this benchmark on non-OpenAI models anymore (which I’m curious to see especially since Claude Sonnet 3.5 outshone the o1 model back in the day). There are other flaws with this benchmark. But this framing seems to be in the right direction.</p>

<p><strong>METR’s Long Horizon Benchmark</strong>: This was an interesting framing, as it considered the time horizons of LLMs working autonomously. They also have a detailed rubric for the “messiness” of a task. I have talked about this benchmark in my article about managing units of work for AI agents.</p>

<p><strong>Multi-SWE-Bench</strong>: ByteDance made a polyglot benchmark that works similarly to SWE-Bench, that spans seven languages: Java, TypeScript, JavaScript, Go, Rust, C, and C++.</p>

<p><strong>SWE-Bench Multilingual</strong>: Another polyglot benchmark that spans nine languages, compatible with the SWE-Bench harness. Has data from popular C, C++, Go, Java, JavaScript, TypeScript, PHP, Ruby and Rust repositories.</p>

<p><strong>HumanEval</strong> (and its variants): An old coding benchmark that should be totally ignored today. The tasks seem to require implementing extremely easy Python toy functions.</p>

<h2 id="benchmarking-is-hard-and-this-makes-me-bullish-on-coding-agents">Benchmarking is hard and this makes me bullish on coding agents</h2>

<p>A large lesson I took away from studying the specifics of popular benchmarks is that designing a good benchmark is highly labour-intensive. Without human review and annotations, it’s nearly impossible to make a good benchmark. The more sophisticated the benchmark gets, the more it seems to require human intervention to ensure that the tasks are high-quality and not nonsensical or impossible.</p>

<p>And then there are the actual evaluation methods. The way to scale up evaluations is to have automated verification across all tasks. It’s not surprising that most evaluations boil down to “make the unit tests pass” due to this reason. But this will always fall short when it comes to actually benchmarking what I consider the core work of an SWE—which is to translate a problem into structured, verifiable solutions. There is some subjectivity and fuzzy judgement involved around satisfying business needs and timelines, making the right architectural tradeoffs and ensuring the solution is good over a long time horizon.</p>

<p>Considering how state-of-the-art benchmarks fall woefully short of capturing the nuance and messiness of SWE work, the coding agents we have are fantastic. One could imagine how much better they would get once we have better benchmarks (and RL environments) that do a better job than what we have today. I’m not sure about how we’d solve for the more subjective parts, but until recently we didn’t even have good polyglot benchmarks. There’s still a lot of low-hanging fruit. This suggests to me that we are yet to hit any kind of wall for coding abilities in the near future.</p>

<p>On a very short notice, I can already think of a bunch of ideas for what could be improved in benchmarks:</p>

<ul>
  <li>Validate using generative testing methods, such as PBT or fuzz testing instead of unit tests.</li>
  <li>Use formal methods to check for correctness, where possible.</li>
  <li>Validate answers against automated User Acceptance Criteria checks, where possible.</li>
  <li>Start with product-level documents as the input for the benchmark, such as business context, PRDs and technical specifications. Validate against automated UATs and end-to-end tests.</li>
  <li>Create a benchmark that accounts for the information acquisition and clarification that real SWEs have to do—I could imagine having a benchmark that intentionally does not give all the necessary information up front, and it’s up to the agent to present the required clarifications or search for the necessary context.</li>
  <li>Use well-calibrated human judges to score on the fuzzier criteria of quality (this seems quite hard to do right, and “well-calibrated” is doing a lot of heavy lifting in that sentence).</li>
</ul>

<h2 id="references-and-further-reading">References and further reading</h2>

<p><strong>SWE-bench / Verified</strong></p>

<ul>
  <li><a href="https://huggingface.co/datasets/princeton-nlp/SWE-bench">Dataset</a></li>
  <li><a href="https://www.swebench.com/leaderboard">Leaderboard</a></li>
  <li><a href="https://arxiv.org/pdf/2504.08703">AWS SWE-bench analysis</a></li>
  <li><a href="https://arxiv.org/pdf/2306.03093">Original paper</a></li>
</ul>

<p><strong>SWE-bench Pro</strong></p>

<ul>
  <li><a href="https://scale.com/research/swe_bench_pro">Scale AI blog</a></li>
  <li><a href="https://arxiv.org/pdf/2507.02825">Paper</a></li>
  <li><a href="https://swe-bench-pro.github.io">Leaderboard</a></li>
</ul>

<p><strong>Aider Polyglot</strong></p>

<ul>
  <li><a href="https://aider.chat/2024/12/21/polyglot.html">Aider blog post</a></li>
</ul>

<p><strong>LiveCodeBench</strong></p>

<ul>
  <li><a href="http://livecodebench.github.io/index.html">Project page</a></li>
</ul>

<p><strong>TerminalBench</strong></p>

<ul>
  <li><a href="https://www.tbench.ai/">Official site</a></li>
  <li><a href="https://docs.tbench.ai">Docs</a></li>
</ul>

<p><strong>SWE-Lancer</strong></p>

<ul>
  <li><a href="https://openai.com/research/swe-lancer">OpenAI blog</a></li>
  <li><a href="https://arxiv.org/pdf/2502.12115">Paper</a></li>
</ul>

<p><strong>METR Long-Horizon</strong></p>

<ul>
  <li><a href="https://metr.org/blog/2025-08-12-research-update-towards-reconciling-slowdown-with-time-horizons/">Blog update</a></li>
  <li><a href="https://openreview.net/pdf?id=VTF8yNQM66">Paper</a></li>
</ul>

<p><strong>Polyglot SWE-bench variants</strong></p>

<ul>
  <li><a href="https://github.com/bytedance-research/Multi-SWE-bench">Multi-SWE-bench (ByteDance)</a></li>
  <li><a href="https://arxiv.org/abs/2507.02825v2">ArXiv paper</a></li>
  <li><a href="https://huggingface.co/datasets/ByteDance/Multi-SWE-bench">Hugging Face dataset</a></li>
  <li><a href="https://github.com/princeton-nlp/SWE-bench-Multilingual">SWE-bench Multilingual</a></li>
</ul>

<p><strong>HumanEval</strong></p>

<ul>
  <li><a href="https://arxiv.org/pdf/2107.03374">Paper</a></li>
  <li><a href="https://github.com/openai/human-eval">GitHub</a></li>
</ul>

<hr />

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:selection" role="doc-endnote">
      <p>My criteria for covering these specific benchmarks was roughly: look at recent frontier model releases and see what coding benchmarks they report they mention. <a href="#fnref:selection" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:utboost" role="doc-endnote">
      <p>The <a href="https://arxiv.org/abs/2506.09289">UTBoost paper</a> exposes how a lot of tasks pass unit tests in SWE-Bench without resolving the underlying issues. <a href="#fnref:utboost" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>
]]></content>
    
  </entry>
  
  <entry>
    
    <title type="html"><![CDATA[The common sense unit of work]]></title>
    <link href="http://blog.nilenso.com/blog/2025/09/17/the-common-sense-unit-of-work/"/>
    
    <updated>2025-09-17T00:00:00+00:00</updated>
    <id>http://blog.nilenso.com/blog/2025/09/17/the-common-sense-unit-of-work</id>
    <summary type="html"><![CDATA[What if we were to model a typical software development lifecycle in code?
]]></summary>
    
    <content type="html"><![CDATA[<p>What if we were to model a typical software development lifecycle in code?</p>

<p>The unit of work would be the fundamental abstraction. We’d build state machines and workflows around it, carrying it from specification to deployment through activities performed by product managers, engineers, designers, and others. The process could be customised to each team’s needs, with all the bells and whistles. But fundamentally, its effectiveness and adaptability depends on how good this central abstraction is.</p>

<p>Get this abstraction wrong, and complexity scales exponentially. All the processes built around it inherit the dysfunction. Planning becomes chaotic, progress becomes opaque, and coordination becomes an expensive mess.</p>

<p>We deal with leaky abstractions by periodically refactoring them, so why not do the same thing with the unit of work? What makes a good unit of work? Let’s walk through these familiar activities and observe the properties that emerge.</p>

<h2 id="breaking-it-down">Breaking it down</h2>

<p>We typically start with product or feature requirements. We don’t usually take on a full feature in one shot, it’s “too big”. Especially if it’s complex enough to need some technical design and specification written along with it. We break it down into <strong>small</strong> parts that are more approachable for solving, and also give us a steady sense of progress.</p>

<p>Now the product requirement is actually a <em>hypothesis</em> for creating business value, and we need to validate the hypothesis as early as possible. So, the small parts need to be <strong>valuable</strong> to the customer.</p>

<p>In other words, we need the unit of work to be a slice of the cake, not a layer.</p>

<p><img src="/images/blog/breaking-it-down.webp" alt="slice-of-cake" /></p>

<p>Of course, bug fixes and refactors aren’t providing value in the same way, and that’s okay. Sometimes there are technical tasks that are best left independent. That’s okay too. No need to be dogmatic as long as the broad needs of value and sense of progress are being met.</p>

<h2 id="planning">Planning</h2>

<p>Before starting work, we want to <strong>prioritise</strong>, because it saves a lot of time. We want to ship the most valuable slices first, and perhaps discard some low priority slices. But we can’t prioritise without weighing the business value against the implementation effort. All slices aren’t the same size, so we <strong>estimate</strong> the implementation effort first.</p>

<p>Then, some large slices can have low product value, so we would want to break them into even smaller slices to prioritise parts we care about most. Some other large slices can’t be sliced further meaningfully, and that’s okay. Some smaller slices can’t be engineered independently, so we build the larger slice anyway. The unit needs to be <strong>negotiable</strong>.</p>

<p><img src="/images/blog/planning-cake-cut.webp" alt="planning-with-slices" /></p>

<p>And since we’re doing this as a team, we’ll want to ensure that the slices are as <strong>independent</strong> as possible, so that we can each do our part without waiting, and we don’t step on each other’s toes.</p>

<h2 id="gathering-context">Gathering context</h2>

<p>A unit can be specified today, picked up for execution next month, blocked by another task, and then deprioritised into the backlog. Over its life, it gathers context about various things:</p>

<ul>
  <li>What value it provides, how to verify it</li>
  <li>How it needs to be implemented</li>
  <li>Missing pieces of context that came together after conversations</li>
  <li>Unknowns that were resolved or unresolved</li>
  <li>Who worked on it, what issues they ran into</li>
  <li>What bugs came up in testing, and QA before release</li>
</ul>

<p><img src="/images/blog/gathering-context.png" alt="gathering-context" /></p>

<p>Keeping these pieces of <strong>context collected in a single place</strong> helps in picking it up from where it was left off. When discussing, implementing, or tracking, it’s useful to have the same artifact in front of us.</p>

<h2 id="solving">Solving</h2>

<p>Knowing exactly what we’re solving for is very helpful, so we can build <em>just enough software™️</em>. No more, no less. So we need to define the <strong>acceptance criteria</strong> that we can all agree on.</p>

<p>Then, solve until we meet them.</p>

<p>It’s good to <strong>automate</strong> checking whether they meet the acceptance criteria, because we’re going to be doing that an awful lot while solving.</p>

<h2 id="verifying">Verifying</h2>

<p>Confidence usually doesn’t require checking every possible case, only the key ones that capture most of the impact. Yes, we checked this slice at every step of the way, but it is useful to inspect it one last time before serving.</p>

<p>When is a unit considered <strong>done</strong>? When the slice has been served. When it’s in the hands of the user, in production, potentially behind a feature flag.</p>

<p><img src="/images/blog/verifying.webp" alt="verifying" /></p>

<p>And that’s it. To manage the life cycle of software development, we manage the unit of work. Some would say we need to <a href="https://xp123.com/invest-in-good-stories-and-smart-tasks/">INVEST</a> in good units of work. And some of you might rightly recognise that it looks like a <a href="https://c2.com/xp/UserStory.html">User Story</a>. But as long as the described properties and affordances for its users exist, it should make for a decent unit of work regardless of what we call it.</p>

<hr />

<h2 id="does-your-unit-of-work-need-refactoring">Does your unit of work need refactoring?</h2>

<p>We’re fairly aware of the penalties of leaky abstractions in software. The incidental complexity of getting our primary real world abstractions wrong, grows exponentially with each layer of software built over it, until the whole system is slow, sludgy slop that’s difficult to work with. We can hack it here and there, and celebrate minor wins, but the big wins were lost in the ignored opportunities to refactor that central abstraction.</p>

<p>If we apply the same thought process to software development, we’ll see that our core abstraction, the unit of work, might need refactoring.</p>

<p>Big gains in developer productivity in this economic weather are important. Organisations that use DORA measure deploy or commit frequencies might find them valuable in some dimensions, but they’re not a measure of productivity in terms of outcomes for the customer. I love these last lines in <a href="">Kent Beck’s writing about measuring developer productivity</a>:</p>

<blockquote>
  <p>Be suspicious of anyone claiming to measure developer productivity. Ask who is asking &amp; why. Ask them what unit they are measuring &amp; how those units are connected to profit.<br />
<br />
I am 100% pro-accountability. Weekly delivery of customer-appreciated value is the best accountability, the most aligned, the least distorting.</p>
</blockquote>

<p>And I think a unit of work as defined above could be used to measure productivity holistically. Prioritising by value, eliminating unnecessary work, and validating quickly then become obvious, and measurable ways to increase productivity.</p>

<p>Productivity gains through use of AI assistants is also popularly reported and benchmarked in terms of % of code generated, but that’s not a very valuable dimension for measurement. If the benchmarks for AI productivity revolved around units of work valuable to the customer, then we’d be talking true productivity gains. AI assistants also need small, well specified slices of work, and hence, will also benefit from a well defined unit of work. My colleague Atharva has written a wonderful <a href="https://blog.nilenso.com/blog/2025/09/15/ai-unit-of-work/">blog post</a> about that in detail.</p>

<p>Yeah, this article is mostly about rehashing a two-decade-old pitch for some common sense agile. But I hope it has been worth your time.</p>

<h2 id="annexes">Annexes</h2>

<ul>
  <li>In reality, the workflow isn’t as linear, and there is much back and forth between the steps. I’ve kept it simple to focus on the properties.</li>
  <li>Yes, I’m aware the classic definition of user stories doesn’t have implementation details.</li>
  <li>Slicing can happen across many dimensions, and breaking down a hard problem effectively, can actually be a very hard problem.</li>
  <li>
    <p>If you want to read the OG Agile material, you can read:</p>

    <ul>
      <li>Kent Beck introducing story cards in <a href="https://www.goodreads.com/book/show/67833.Extreme_Programming_Explained">XPX</a> (Chapter 15 on planning)</li>
      <li>Bill Wake’s <a href="https://xp123.com/user-stories/">writing</a>, and the INVEST criteria are condensed, quick reads</li>
      <li>The <a href="https://c2.com/xp/UserStory.html">C2 page</a> on User Stories for opinions and some discussions</li>
      <li>Ron Jefferies on <a href="https://ronjeffries.com/xprog/articles/expcardconversationconfirmation/">Card, Conversation and Confirmation</a>.</li>
      <li>Mike Cohn’s <a href="https://www.goodreads.com/book/show/3856.User_Stories_Applied">User Stories Applied</a> is a deep dive.</li>
    </ul>
  </li>
  <li>I like <a href="https://newsletter.pragmaticengineer.com/p/measuring-developer-productivity">Gergely Orosz and Kent Beck’s response to McKinsey</a> on measuring developer productivity. Gergely’s <a href="https://newsletter.pragmaticengineer.com/p/developer-productivity-a-new-framework">writing about DORA, and SPACE</a> is interesting, but I wonder if metrics can be more granular, around this unit of work, and its affordances. That would shift-left the feedback on productivity, to where it matters.</li>
</ul>
]]></content>
    
  </entry>
  
  <entry>
    
    <title type="html"><![CDATA[The quality of AI-assisted software depends on unit of work management]]></title>
    <link href="http://blog.nilenso.com/blog/2025/09/15/ai-unit-of-work/"/>
    
    <updated>2025-09-15T00:00:00+00:00</updated>
    <id>http://blog.nilenso.com/blog/2025/09/15/ai-unit-of-work</id>
    <summary type="html"><![CDATA[The craft of AI-assisted software creation is substantially about correctly managing units of work.
]]></summary>
    
    <content type="html"><![CDATA[<p>The craft of AI-assisted software creation is substantially about correctly managing units of work.</p>

<p>When I was new to this emerging craft of AI-assisted coding, I was getting lousy results, despite the models being rather intelligent. Turns out the major bottleneck is not intelligence, but rather providing the correct context.</p>

<p>Andrej Karpathy, <a href="https://youtube.com/clip/Ugkx7m0MVzHTnKXdoDjlqei60zlK4DWCXWr2?si=kIwnm0xQXdSKMQCC">while referencing</a> my <a href="/blog/2025/05/29/ai-assisted-coding/">earlier article on this topic</a>, described the work of AI-assisted engineering as “putting AI on a tight leash”. What does a tight leash look like for a process where AI agents are operating on your code more independently than ever? He dropped a hint: work on small chunks of a single concrete thing.</p>

<h2 id="the-right-sized-unit-of-work-respects-the-context">The right sized unit of work respects the context</h2>

<p>I like the term <a href="https://simonwillison.net/2023/Jan/23/riley-goodside/">context engineering</a>, because it has opened up the vocabulary to better describe why managing units of work is perhaps the most important technique to get better results out of AI tools. It centers our discussion around the “canvas” against which our AI is generating code.</p>

<p>I like <a href="docs.anthropic.com/en/docs/build-with-claude/context-windows">Anthropic’s visualisation</a> from their docs:</p>

<p><img src="/images/blog/context-window-thinking-tools.jpg" alt="Anthropic's visualisation of a context window filling up for each turn until it exceeds the window limit" /></p>

<p>The generated output of the LLM is a sample of the next token probability. Every time we generate a token, what has already been generated in the previous iteration is appended to the context window. What this context window looks like has a huge influence on the quality of your generated output.</p>

<p><a href="https://www.dbreunig.com/2025/06/26/how-to-fix-your-context.html">Drew Breunig wrote an excellent article</a> about all kinds of things that can go wrong with your context and proposed various techniques to fix them.</p>

<p>The best AI-assisted craftsmen are often thinking about the design and arrangement of their context to get the AI to one-shot a solution. This is tricky and effortful, contrary to what the AI coding hype suggests.</p>

<p>If you don’t provide the necessary information in the context to do a good job, your AI will hallucinate or generate code that is not congruent with the practices of your codebase. It is especially brittle at integration points of your software system.</p>

<p>On the other hand, if you fill up the context with too much information, and <a href="https://research.trychroma.com/context-rot">the quality of your output degrades</a>, because of a lack of focused attention.</p>

<p>Breaking down your task into “right-sized” units of work, which describe just the right amount of detail is perhaps the most powerful lever to improve your context window, and thus the correctness and quality of the generated code.</p>

<h2 id="the-right-sized-unit-of-work-controls-the-propagation-of-errors">The right sized unit of work controls the propagation of errors</h2>

<p>Time for some napkin maths.</p>

<p>Let’s say your AI agent has a 5% chance of making a mistake. I’m not just referring to hallucinations—it could be a subtle mistake because it forgot to look up some documentation or you missed a detail in your specification.</p>

<p>In an agentic multi-turn workflow, which is what all coding workflows are converging to, this error compounds. If your task takes 10 turns to implement, you will have a (1 – 0.95)<sup>10</sup> = 59.9% chance of success. Not very high.</p>

<p><a href="https://utkarshkanwat.com/writing/betting-against-agents">Utkarsh Kanwat in his blog post</a> has made the same argument. His conclusion was that any AI agent would need some kind of pause-and-verify gating mechanism at each step for a long-horizon task.</p>

<table>
  <thead>
    <tr>
      <th rowspan="2">Per-action<br />error rate</th>
      <th colspan="4">Overall Success Rate</th>
    </tr>
    <tr>
      <th>5 turns</th>
      <th>10 turns</th>
      <th>20 turns</th>
      <th>50 turns</th>
    </tr>
  </thead>
  <tbody>
    <tr><td>0.1%</td><td>99.5%</td><td>99.0%</td><td>98.0%</td><td>95.1%</td></tr>
    <tr><td>1%</td><td>95.1%</td><td>90.4%</td><td>81.8%</td><td>60.5%</td></tr>
    <tr><td>5%</td><td>77.4%</td><td>59.9%</td><td>35.8%</td><td>7.7%</td></tr>
    <tr><td>10%</td><td>59.0%</td><td>34.9%</td><td>12.2%</td><td>0.5%</td></tr>
    <tr><td>20%</td><td>32.8%</td><td>10.7%</td><td>1.2%</td><td>0.0%</td></tr>
  </tbody>
</table>

<p>What does the state of the art for multi-turn error rates look like? METR recently published a popular chart <a href="https://metr.org/blog/2025-07-14-how-does-time-horizon-vary-across-domains/">describing how AI models are getting better at long-horizon tasks</a>. Currently GPT-5 is at the top of the leaderboard, where it can perform ~2-hour long tasks at around a 70% success rate. Working backwards (let’s say a 2 hour task is 50+ turns) this would amount to a sub-1% error rate per action.</p>

<p><img src="/images/blog/metr.png" alt="Chart showing results of METR's chart showing task horizons increasing over time" /></p>

<p>Doesn’t a &lt;1% error rate per action seem suspicious to you? As a regular user of agentic coding tools (my current one is Codex CLI), I’ll eat my shoe if GPT-5 starts nailing my tasks 99.9% of the time.</p>

<p>My intuition derived from experience tells me that even the best AI right now isn’t even 95% likely to be correct. So where is the difference coming from? It needs a closer look at the actual paper:</p>

<blockquote>
  <p>Our tasks typically use environments that do not significantly change unless directly acted upon by the agent. In contrast, real tasks often occur in the context of a changing environment.</p>

  <p>[…]</p>

  <p>Similarly, very few of our tasks are punishing of single mistakes. This is in part to reduce the expected cost of collecting human baselines.</p>
</blockquote>

<p>This is not at all like the tasks I am doing.</p>

<p>METR acknowledges the messiness of the real world. They have come up with a “messiness rating” for their tasks, and the “mean messiness” of their tasks is 3.2/16.</p>

<p>By METR’s definitions, the kind of software engineering work that I’m mostly exposed to would score at least around 7-8, given that software engineering projects are path-dependent, dynamic and without clear counterfactuals. I have worked on problems that get to around 13/16 levels of messiness.</p>

<blockquote>
  <p>An increase in task messiness by 1 point reduces mean success rates by roughly 8.1%</p>
</blockquote>

<p>Extrapolating from METR’s measured effect of messiness, GPT-5 would go from 70% to around 40% success rate for 2-hour tasks. This maps to my experienced reality.</p>

<p>I am not certain that pure intelligence can solve for messiness. Robustness to environmental chaos and the fuzzy nature of reality is fundamentally about managing context well. Until we find the magic sauce that solves this, it is clear that we need a workflow that can break down our problem into units of work, with verifiable checkpoints to manage the compounding of errors.</p>

<p>These verifiable checkpoints need to be <em>legible to humans</em>.</p>

<p><img src="/images/blog/unit-of-work-management.jpg" alt="A diagram of boxes that represent units of work, with circles that represent checkpoints where users can verify outcomes and make corrections" /></p>

<h2 id="so-what-is-the-right-sized-unit-of-work">So, what is the “right sized” unit of work?</h2>

<p>The right sized unit of work needs to be small and describe the desired outcome concisely.</p>

<p>The desired outcome on completion of a unit of work needs to be human-legible. I argue that it needs to provide legible <em>business value</em>. Ultimately, the users of software are going to be humans (or systems that model human constructs). Therefore, an elegant way to break down a project is to model it as small units of work that provide legible business value at each checkpoint. This will serve the purpose of respecting the context window of the LLM and help manage the propagation of errors.</p>

<p>Software engineers have already defined a unit of work that provides business value and serve as the placeholder for all the context and negotiation of scope—User Stories. I think they are a good starting point to help us break down a large problem into smaller problems that an LLM can one-shot, while providing a concrete result. They center <em>user outcomes</em>, which unlike “tasks”, are robust to the messy dynamic environment of software development. Srihari has elegantly written about the value of user stories in the software development process, and I recommend <a href="/blog/2025/09/17/the-common-sense-unit-of-work/">reading his post</a> to better understand the properties that make them suitable for the messy work of building software.</p>

<p>Deliverable business value is also what all stakeholders can understand and work with. Software is not built in a vacuum by developers—it needs the coordination of teams, product owners, business people and users. The fact that AI agents work in their own context environment separate from the other stakeholders hurts effectiveness and transfer of its benefits. I think this is an important gap that needs to be bridged.</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>unit size</th>
      <th>outcome of completion</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>TODO item</td>
      <td>small</td>
      <td>incremental technical value</td>
    </tr>
    <tr>
      <td>“Plan Mode”</td>
      <td>large</td>
      <td>technical value</td>
    </tr>
    <tr>
      <td>Amazon Kiro Spec</td>
      <td>small</td>
      <td>technical value</td>
    </tr>
    <tr>
      <td>User Story</td>
      <td>small</td>
      <td>business value</td>
    </tr>
  </tbody>
</table>

<p>Most AI agents today have well-functioning “planning” modes. These are good at keeping the agent on rails, but they mostly provide technical value, and not necessarily a legible business outcome. I believe planning is complementary to our idea of breaking down a project into small units of business value. My proposed unit of work can be planned with existing planning tools. And I believe this is superior to planning over a large unit of work due to the context rot issues described earlier.</p>

<p>Of course, plain old User Stories as described in the Agile canon is not sufficient. It needs to be accompanied by “something more” that can nudge the agents to gather the right context that serves the business value outcome of the stories. What that “something more” could look like is something we hope to answer in the coming months.</p>

<h2 id="the-storymachine-experiment">The StoryMachine experiment</h2>

<p>To test whether user stories with “something more” can indeed serve as optimal units of work that that have the properties I described above, we are running an experiment called <a href="https://github.com/nilenso/storymachine">StoryMachine</a>. Currently StoryMachine does not do much—it reads your PRD and Tech Specs and produces story cards. It is still early days. But we will set up an evaluation system that will help us iterate to a unit of work description that helps us build useful software effortlessly. I hope to share updates on what we find in the coming months.</p>

<p>I want the craft of AI-assisted development to be less effortful and less like a slot-machine. And our best lever to get there is managing the unit of work.</p>
]]></content>
    
  </entry>
  
  <entry>
    
    <title type="html"><![CDATA[My Quarterly System Health Check-in]]></title>
    <link href="http://blog.nilenso.com/blog/2025/09/05/my-quarterly-system-health-check-in-beyond-the-dashboard/"/>
    
    <updated>2025-09-05T00:00:00+00:00</updated>
    <id>http://blog.nilenso.com/blog/2025/09/05/my-quarterly-system-health-check-in-beyond-the-dashboard</id>
    <summary type="html"><![CDATA[It is essential to periodically take a few steps back from the day to day and reflect on where we are against our strategic goals. If you’re an engineering leader, a head of engineering, a director, or a VP, you likely have a recurring meeting to this effect.
]]></summary>
    
    <content type="html"><![CDATA[<p>It is essential to periodically take a few steps back from the day to day and reflect on where we are against our strategic goals. If you’re an engineering leader, a head of engineering, a director, or a VP, you likely have a recurring meeting to this effect.</p>

<p>In this post, I propose a structure for this operational exercise (complementing a business review) that lasts 2-4 hours, every month or quarter. I see quality as solving for the Pareto front with the tangible dimensions of reliability, performance, cost, delivery and security, and the more intangible dimensions of simplicity and social structures. For each dimension, go through the list of questions below and try to answer them together. The questions are:</p>

<ul>
  <li>Intentionally informal to provoke honest discussion.</li>
  <li>Intuitive proxies for metrics. Numbers matter, and we should look at dashboards during discussions, but we need to go beyond the numbers and talk about the problems they represent.</li>
  <li>Intentionally <em>instinctive, and</em> <em>emotional</em>. They work by poking at symptoms, and leveraging the learned senses of trusted engineers, rather than breaking everything down to the raw facts. How they <em>feel</em> is an important signal, and it is a leader’s job to <em>listen</em> to them<em>.</em></li>
  <li>Per system, where a system refers to a software service, or a group of them that you want to treat as a whole. I would suggest different review meets for systems, or teams you want to treat independently.</li>
  <li>Meant to be answered by people who actively work on the software every day</li>
  <li>About <em>effectiveness</em> at the <a href="https://en.wikipedia.org/wiki/Pareto_front">Pareto front</a>, not necessarily <em>efficiency</em>.</li>
  <li>Not novel. They’re what I hope most experienced developers would consider common sense.</li>
</ul>

<hr />

<h2 id="simplicity">Simplicity</h2>

<p>This is the most important dimension to reflect on quality. But it is best treated as an intangible dimension, and is hard to measure objectively. A simple system is performant, does one thing well, is cheap, and reliable. And a good engineer knows this intuitively.</p>

<ol>
  <li>What would a new engineer experience?
    <ul>
      <li>Can we explain the system’s responsibility in plain english, within 5 minutes?</li>
      <li>Can they form a correct mental model in under one hour using only docs and diagrams?</li>
      <li>Do we find ourselves apologising when explaining how the system works?</li>
      <li>How long does it take for a new engineer to be onboarded? Is the time to first PR acceptable?</li>
    </ul>
  </li>
  <li>Is the domain simple? And Is the domain modelling simple?
    <ul>
      <li>What are its core domain entities (<a href="https://martinfowler.com/bliki/DDD_Aggregate.html">aggregates</a>), and would you say there are many?</li>
      <li>Is the current design made of small and composable components?</li>
      <li>Do simple modifications you expect in hours, take many days? And is that surprising?</li>
      <li>Are there Architecture Decision Records (ADRs) that you keep referring to for key decisions?</li>
      <li>Does a small feature need modifications in many places, or in multiple modules?</li>
      <li>Do users mostly figure out how to use, or build-on the system through its readme, and interfaces, or do we have to often answer questions around the usage?</li>
    </ul>
  </li>
  <li>Is it simple to observe, debug, and diagnose?
    <ul>
      <li>Do you need to look at values from the database or cache to diagnose most issues?</li>
      <li>Can you reproduce bugs simply by making the same request again?</li>
      <li>If the system is stateful, how is state represented? Is it simple to reconstruct the state at a given point in time?</li>
    </ul>
  </li>
  <li>Is the domain inherently complex, or is the software incidentally complex?
    <ul>
      <li>Are we afraid of making changes in this system because we might break things we don’t understand?</li>
      <li>How often do we need to confer with the one person in the team who remembers why things are the way they are?</li>
      <li>If we squash a bug, do two others take its place?</li>
      <li>Given a chance to rewrite, what exactly would we change and why?</li>
    </ul>
  </li>
</ol>

<hr />

<h2 id="delivery">Delivery</h2>

<p>That is, delivery of business value, not code. It’s faster to deliver newer and smaller software. As it gets older and larger, how do we maintain the speed? It’s mostly about the flywheel. How quickly do we get feedback, and how much of it do we get before going to production?</p>

<ol>
  <li>Does it feel like we are moving slowly?
    <ul>
      <li>If so, what do you think makes us slow?</li>
      <li>Are we happy with our deploy frequency, and rollback rate?</li>
      <li>How much time does it take, on average, for a user-story (business value), to go from in-progress to done (in-production)? Has this been improving or deteriorating?</li>
      <li>How confident are we about our estimates? Is our predictability improving? What’s our track record of sticking to estimated timelines?</li>
    </ul>
  </li>
  <li>How much confidence do we have in our tests, builds and deployments?
    <ul>
      <li>Do green builds on main automatically deploy to production?</li>
      <li>Will we deploy during peak hours, or during Friday nights?</li>
      <li>Can we refactor anything as long tests pass? What are we afraid of refactoring?</li>
      <li>Do tests pass consistently on CI, and everyone’s machines?</li>
      <li>What % of the codebase is abandoned, irrelevant, or barely used in production?</li>
    </ul>
  </li>
  <li>How good is our local setup?
    <ul>
      <li>Does everyone in the team run their service locally on their machine?</li>
      <li>Do the service dependencies like databases, queues, caches etc also run locally? Are they in-memory so they’re faster?</li>
      <li>How much time does it take to run tests locally?</li>
      <li>How much time does it take to install dependencies and build a service locally?</li>
      <li>Is the local setup a single command, and does it “just work”?</li>
    </ul>
  </li>
  <li>What’s the % split of feature / bug / chore (tech-task)?
    <ul>
      <li>Do we understand reasons for current split?</li>
      <li>Does the split reflect our strategic focus?</li>
    </ul>
  </li>
</ol>

<hr />

<h2 id="reliability">Reliability</h2>

<p>This dimension is fairly well quantified, usually. So it’s possible to be objective, and look at the numbers for SLOs, uptime, etc and make meaningful judgements from it. However, I’ll still write out questions for the subjective / qualitative aspects that should be evaluated in addition to the objective ones (that I am not writing about).</p>

<ol>
  <li>Is our incident management healthy?
    <ul>
      <li>Do we ignore alerts because they’re noisy?</li>
      <li>How many alerts do we receive or every week? Does it feel like we’re constantly fire fighting?</li>
      <li>How many of those alerts were for previously diagnosed issues?</li>
      <li>How often are incidents detected by users rather than alerts?</li>
      <li>Which stage needs most attention? Detection, Triage, Diagnosis, Recovery, or Review?</li>
      <li>Does a newcomer know what to do when they receive an alert?</li>
    </ul>
  </li>
  <li>Are our reliability expectations reasonable, and clear?
    <ul>
      <li>What are the critical user journeys supported by this system?</li>
      <li>Do we have product-level SLOs for user journeys end-to-end that are well monitored?</li>
      <li>Are the reliability issues due to essential or incidental complexity?</li>
      <li>How can we get away with lesser reliability requirements?</li>
      <li>Do we need to be transactionally, or eventually consistent? If eventually consistent, in how much time? Can that be relaxed for better reliability?</li>
      <li>Do we feel like we’re re-inventing the wheel? Would using existing, or managed solutions be more reliable?</li>
    </ul>
  </li>
  <li>How much control do we have on our reliability?
    <ul>
      <li>How much are our systems dependent on external systems? Are they optional or required?</li>
      <li>Are we happy with our timeouts, fallbacks, and defaults?</li>
    </ul>
  </li>
  <li>Fault isolation, graceful degradation, and automatic recovery.
    <ul>
      <li>When one system fails, do others fail? And is that failure cascade reasonable as per their domain responsibilities?</li>
      <li>At loads over capacity, does the system continue to work at capacity?</li>
      <li>What parts of recovery are manual?</li>
    </ul>
  </li>
</ol>

<hr />

<h2 id="performance">Performance</h2>

<p>We just need to be performant enough to enable the next growth curve of the business. This section cares about that kind of performance. Effective, not efficient.</p>

<ol>
  <li>Are the performance expectations clear, and reasonable?
    <ul>
      <li>Does the workload or throughput actually match business metrics, or is it inflated incidentally? Reducing scale is the best way to deal with scale.</li>
      <li>Are the performance expectations for the system clear? Are SLOs set beforehand?  Are they reasonable? Begin with the end in mind.</li>
      <li>Is performance of the journey user-bound, or system-bound? If user-think-time dominates, performance of the system isn’t the most important concern.</li>
    </ul>
  </li>
  <li>Is the team truly aware of the current state?
    <ul>
      <li>Do people know (by memory) the approximate p99s of critical operations?</li>
      <li>Does the team know the normal throughput patterns (morning and evening peaks for example) to know that something is amiss by just looking at the shape of throughput through time?</li>
      <li>What are the normal and peak resource utilisations of the system? Do we have leading indicators of trouble, or just lagging ones?</li>
      <li>Which components will require redesign before horizontal scaling is viable?</li>
    </ul>
  </li>
  <li>Do we know what direction to improve performance in?
    <ul>
      <li>Which resource is the bottleneck? Compute, memory, I/O, network, or something else?</li>
      <li>Is the synchronous request path tight, with everything non-critical kicked to async tasks?</li>
      <li>What kind of throughput disrupts performance? Unpredictable spikes and bursts, or more predictable plateaus and sawtooths?</li>
      <li>What accuracy, or consistency requirements constrain performance?</li>
      <li>Has optimisation created rigidity? Are we boxed into micro-optimizations that block larger design moves?</li>
    </ul>
  </li>
</ol>

<hr />

<h2 id="organisation">Organisation</h2>

<p>Software architecture is socio-technical. Organisation design, system design, and process design are deeply connected. How people are organised, and how they communicate, is <a href="https://en.wikipedia.org/wiki/Conway%27s_law">reflected in the software architecture</a>, and vice-versa. Yet, these aspects are often seen as independent, or unrelated.</p>

<p>In order to enable high ownership and agency, we should be willing to restructure teams or rescope responsibilities, just as much as we’re willing to change software architecture.</p>

<ol>
  <li>Is the team’s responsibility and ownership clear?
    <ul>
      <li>How dissonant is the product &lt;&gt; team &lt;&gt; system responsibility overlap?</li>
      <li>If this system is responsible for a journey, does the team own the journey and funnel too?</li>
      <li>Does the system fit into the boundaries of the team’s business domain?</li>
    </ul>
  </li>
  <li>Are we fighting against organisational structure with software structure?
    <ul>
      <li>What are the downstreams and upstreams of the system, and are they owned by teams that are close to this team in the org-chart?</li>
      <li>What is the average number of people required to own a system, and what are the outliers (on both ends) to that?</li>
      <li>Is the system responsible for many things?</li>
      <li>Is the work-life balance for people different based on the systems they work on? Is that warranted?</li>
    </ul>
  </li>
  <li>Do our processes complement or contrast our architecture?
    <ul>
      <li>Do incentives reward short term velocity over long term manoeuvrability?</li>
      <li>How many meetings could have been avoided if we had clear contracts?</li>
      <li>Do users build on the system by composing and configuring, or do they need to collaborate and coordinate with us to get work done?</li>
      <li>Which people have “meetings all day”, and is their time best spent that way? Does it compensating for poor system or org structure?</li>
      <li>Which processes would we drop if we trusted everyone like we trust ourselves?</li>
      <li>Is there a CONTRIBUTING.md defined?</li>
      <li>What is the average number of PRs open at any point in time?</li>
    </ul>
  </li>
</ol>

<h2 id="cost-security">Cost, Security</h2>

<p>These are important dimensions, and are very much part of the Pareto frontier. Unfortunately, I haven’t built a lot of intuitions with these dimensions. I understand them enough to work on related problems, but not enough to be writing a health check questionnaire on. I hope someone does this. It would be useful.</p>

<p>Write in if you would like to collaborate on these sections with me!</p>

<hr />

<p>The questions will make conversations happen, but it is up to you to truly listen, understand, and make the most of it. I would suggest using the meeting to focus on the problems. You can prioritise and solve for them later on.</p>

<p>Also, I’m also assuming you’re doing the work of ensuring the work is focused on the right problem to begin with. Climbing a ladder fast isn’t useful if it’s against the wrong wall.</p>

<p><em>Thanks to Atharva for his thoughtful review of this post.</em></p>
]]></content>
    
  </entry>
  
</feed>