Trajectory shapes

Why are we managing our coding agents based on vibes instead of their actual work habits?

Leaderboards tell us how many tasks a model solved but not how. The trajectories that they leave behind tell the story of how, but these aren’t studied as much. The tools a model prefers, and the overall “shape” of its workflow is visible through an empirical lens.

I analyzed the latest SWE-Bench Pro trajectories I could find: runs from October 2025 for Sonnet 4.5 and GPT-5, 730 task trajectories per model. I deterministically classified each step into activities like understand, edit, verify, and cleanup using only tool calls, literal command/filename matches, and regex heuristics, and then computed their share over time. This “trajectory shape” chart I got as a result is very interesting!

Here are the main work habits I see in this chart:

  1. Sonnet starts editing early (at 35%), GPT starts editing much later (at 50%)
  2. Sonnet is done with the implementation early (at 62%)
  3. Sonnet spends a LOT of time verifying after the last source edit
  4. Sonnet has to cleanup temporary files, GPT doesn’t have to
  5. GPT-5 reads a LOT to front-load context, before it starts editing (at 50%)
  6. GPT-5 barely does any verification

If I were to condense the above to “vibes”, I might say “Claude starts editing early and figures it out in the loop. GPT reads first, then goes for the one-shot.” That is also close to what people on my timeline were saying when these runs were current. For example, Eric Provencher wrote:

Now I can say the same thing with data, and with a better sense of the nuances. And once we see these habits clearly, we can stop managing agents on vibes and start steering them on evidence.

With newer models and a maintainer

I wanted to replicate this chart for Opus 4.6 and GPT-5.4, but the SWE-Bench Pro trajectories (or any other SWE-agent trajectories) for them are unavailable. I then remembered that Mario Zechner has been publishing his Pi trajectories on Hugging Face, and downloaded them.

Luckily, he works through GitHub issues methodically. Each session is one issue, and one model. It kicks off with the same analysis prompt, waits for his go ahead, there is a varied amount of steering to address it, and then he wraps up explicitly to ship a fix, close the issue, leave a triage comment, and so on. Each one starts with git work (read the PR, read the comments) and ends with git work (push the change, close the issue, or do the final triage), with the source-editing loop in between.

While the issue fixes make for similar trajectories to SWE-Bench Pro, I’m still changing the models, harness, and adding a maintainer, so this is not exactly an apples-to-apples comparison. To keep the trajectory-shape comparison closer to the benchmark, the Pi panels below use only the strict single-model sessions whose final title starts with Issue:. That said, the observations from the previous section still hold. The bars below each Pi panel give a rough sense of where Mario is steering more during the run; taller bars mean more intervention around that part of the trajectory. The labels mark Mario’s go ahead and wrap up moments ().

I think it’s likely that the explicit analysis prompt is pushing the understand phase to be longer. The first edit is pushed from 35% to 47%. After that, there’s a fair amount of steering, QA, critique, and validation, with verify + edit cycles, after which it wraps up. The push to verify in the SWE-agent prompt likely causes the benchmark trajectory to go on longer, and is fairly redundant.

As with Claude, the analysis prompt makes GPT go on even longer, pushing the first edit from 50% to 63%. However, the edit duration has reduced significantly from 40% to 23%. I suspect this is a combination of the model upgrade and Mario’s steering towards “minimal” and “concise” solutions.

Why compare trajectory shapes?

Trajectories are traces of learned workflows. RL shapes both micro preferences, like which tools a model uses, and macro preferences, like the overall workflow it follows from problem statement to verified solution. To study those macro patterns, one trajectory is not enough; you need a corpus. And to compare a corpus, you need normalization. So, each trajectory is mapped onto the same 0–100% start-to-finish frame so that the aggregate shape reflects the learned workflow. For instance, see the difference in trajectory lengths of the SWE-Bench Pro:

Mean steps as the dot, p25–p75 as the bar, on a fixed 0–100 scale.

0 50 100
steps
SWE agent Sonnet 4.5
730 tasks · 43.7% resolved
77.5
63–89
SWE agent GPT-5
730 tasks · 36.3% resolved
59.5
34–76
Mario + Pi · Opus 4.5/4.6
49 sessions · 98.0% completed
30.9
16–35
Mario + Pi · GPT-5.4
77 sessions · 96.1% completed
40.6
27–52

Further, tool calls are normalized by structural intent rather than literal tool identity, so that equivalent actions can be compared across harnesses. In this study, I normalised tool calls across SWE-agent and Pi, and the table below documents that taxonomy.

So, if you recognize these patterns in your agents, and they don't fit your task at hand, you could steer them accordingly.

Observed pattern Possible steering move
Claude starts editing early Ask for explicit analysis first
Claude finishes implementation early Ask it to verify before concluding
Claude over-verifies after the last edit Specify exact success criteria
Claude leaves cleanup behind Tell it not to create temporary docs or scripts
GPT front-loads context gathering Usually fine. Or narrow what context it should gather
GPT is not verifying enough Ask for red-green TDD

Appendix

Analysis code and data: github.com/nilenso/swe-bench-pro-cost-token-time-analysis.

SWE-Bench Pro

Pi transcripts

1. GPT-5’s tool failures

I have ignored these failures for the purpose of the charts, but they are significant enough to mention here. GPT-5's tool calls fail in about 20% of the steps throughout the SWE-agent trajectory. Most notably, it expects the apply_patch tool to be present, and keeps calling that. apply_patch is the main editing tool within Codex, and GPT-5’s own system prompts inside Codex had counterweights for this behavior. GPT-5 manages to work through the failures and still achieve a decent resolution rate on the task. However, we do not know what effect those failing tool calls have on the outcome.

Example failures and data: failure modes in the reference appendix.

2. High-level action frequencies

On SWE-Bench Pro, even if we ignore when actions happen and just count what the models spend their steps on, the same signature shows up. GPT-5 spends much more of its trajectory reading, while Sonnet 4.5 spends much more of it verifying. That makes the difference in the trajectory-shape chart feel less like a visual artefact and more like a stable workflow habit.

  • To “understand”, GPT reads more, while both models search about the same amount
  • Verification leans the other way. Sonnet 4.5 runs ~3 verify steps per edit, GPT-5 runs ~1 verify step per 3 edits
  • Git + housekeeping is effectively Sonnet-only (6.2% vs 0.2%). GPT-5’s trajectory has no bookkeeping tail at all.
  • Pre-edit information gathering (read + search + reproduce) is 61% of GPT-5’s steps vs 49% of Sonnet’s. GPT-5 front-loads context; Sonnet spreads work more evenly across the trajectory.
  • Around 40% of GPT-5 trajectories don’t do ANY verification at all
  • Sonnet 4.5 re-runs the same passing tests many times, without any changes in source

3. Models have tool preferences

On SWE-Bench Pro, the tools provided in the RLVR environment likely shape the tool preferences of models. Here are the various tools used per category, compared across Sonnet 4.5 / GPT-5. I wish I had SWE-Bench Pro trajectories for Opus 4.6 and GPT-5.4. With the same tasks and harness, I could compare tool-call preferences, costs, latencies, and more.

4. Intent Classification Taxonomy

The benchmark charts above and the Pi / Mario charts later in the post use the same high-level taxonomy, but not always the same raw labels. So this appendix merges both into one deterministic reference table. Raw labels stay in monospace under each intent class so chart clicks can still land on the exact label.

Unlike the trajectory-shape panels, which use the stricter issue-only Pi subset, the Pi counts in this appendix use the broader all strict single-model sessions cut: 171 analysed Opus 4.5/4.6 trajectories and 133 analysed GPT-5.4 trajectories. The benchmark side remains the full SWE-Bench Pro pair with 730 trajectories per model.

How to read this table:

  • Both sides are deterministic. No model inference is used in either taxonomy.
  • Rows are merged by intent class. If SWE-agent and Pi use the same raw label, it appears once. If they use different labels for the same kind of action, both labels appear under the same row with SWE / Pi badges.
  • Blank counts mean no separate label on that side. For example, Pi has no distinct insert-source bucket, while Pi adds git workflow labels like git-github-context and git-publish.
  • Counts are per dataset. The left pair is the SWE-agent benchmark set (730 Sonnet 4.5 trajectories, 730 GPT-5 trajectories). The right pair is the all-single-model Pi set (171 Opus 4.5/4.6 trajectories, 133 GPT-5.4 trajectories).
  • The trajectory-shape panels are narrower. Earlier Pi charts stay on the issue-only subset because that keeps the comparison closer to the benchmark’s issue-fix workflow.
intent class description SWE-agent Pi (all single-model)
Sonnet 4.5
730 trajs
GPT-5
730 trajs
Opus 4.5/4.6
171 trajs
GPT-5.4
133 trajs
read (understand)
View an entire file
read-file-full
SWE: str_replace_editor view <file> after test/config/range/truncation cases are ruled out. Pi: whole-file read(path). 3,125 5,020 455803
View a specific range
read-file-range
SWE uses --view_range; Pi uses read(path, offset, limit). 5,974 5,997 524546
Whole-file read, but truncated
read-file-full(truncated)
File was too large to show in full; the read/view output was abbreviated. 198 245 119109
Read a config / manifest file
read-config-file
Filename match such as package.json, pytest.ini, setup.cfg, setup.py, go.mod, Makefile, or config.json. 26 206 1928
Read a test file as a distinct class
SWEread-test-file
SWE-only filename match for test_*, *_test.*, or conftest*. 644 635
Read via shell command
read-via-bash
Shell read commands like cat, head, tail, sed -n, nl, or awk. 2,345 2,974 7614
Read via inline snippet
read-via-inline-script
Inline code does a pure read-and-print, e.g. .read(), open(...,'r'), or readFileSync, with no write. 76 373 010
Browse a directory through the editor interface
SWEview-directory
SWE-only: str_replace_editor view where the path has no extension, or the observation says “files and directories”. 1,137 2,133
search (understand)
List a directory from the shell
list-directory
Directory / cwd inspection via ls, tree, or pwd. 843 708 8511
Search for a keyword / pattern
search-keyword
Pattern search via grep, rg, or ag. 7,002 6,499 699584
Find files by name
search-files-by-name
find ... -name without a grep / xargs content-search pipeline. 1,792 49 4734
Find files by content via find/grep pipelines
search-files-by-content
find ... -exec grep or find ... | xargs grep. 3,254 10 252
Inspect file metadata
inspect-file-metadata
Metadata checks like wc, file, or stat. 246 22 458
Check runtime / tool version
SWEcheck-version
Tiny version probes such as --version, -V, sys.version, or node -v. 6 2
Search the web
Piweb-search
Pi-only external lookup via the brave-search skill / search.js-style web search call. 25
reproduce
Create a repro artifact
create-repro-script
Create a file whose name matches repro*, reproduce*, or demo*. 157 463 07
Run a repro artifact
run-repro-script
Run a named python/node/sh/bash/go run script whose basename matches repro* / reproduce* / demo*. 375 1,067 07
Run a residual inline snippet
run-inline-snippet
Inline python -c, python - <<, or node -e that did not match a more specific inline read/edit/verify pattern. 472 193 218
edit
Edit existing source
edit-source
Source-file edit on a path that does not match test / repro / verify / check heuristics. 5,217 4,983 528337
Insert into source
SWEinsert-source
SWE-only str_replace_editor insert action. 12 803
Apply a patch blob
SWEapply-patch
SWE-only applypatch command path, mostly GPT-specific. 0 94
Create a new non-test file
create-file
Create/write a file that does not match repro / test / verify / documentation filename heuristics. 595 326 3852
Edit via inline script
edit-via-inline-script
Inline script reads a file, changes text via things like .replace() or re.sub(), then writes it back. 5 245 03
Create a file via inline script
create-file-via-inline-script
Inline script writes a file with no prior read. 21 41 319
verify
Run a broad test suite
run-test-suite
Broad runner commands such as pytest, go test, npm test, jest, mocha, yarn test, or python -m unittest. 5,942 585 452
Run targeted tests
run-test-specific
Test command narrowed by ::, -k, or an explicit file/filter target. 1,105 370 2351
Create a regression / test file
create-test-script
Create a file such as test_*, *test.py, *test.js, or *test.go. 2,633 18 913
Run a named verify / check script
SWErun-verify-script
Run a named script whose basename contains test_, verify, check, validate, or edge_case. 3,420 113
Create a named verify / check script
SWEcreate-verify-script
Create a file matching verify*, check*, or validate*. 321 47
Edit a test or repro file
edit-test-or-repro
Edit a file whose path/name matches test / repro / verify / check heuristics. 712 243 4760
Run a custom named script
run-custom-script
Run a named python/node/sh/bash/go script that does not match repro/test/verify patterns. 476 111 3128
Syntax-only check
SWEsyntax-check
Syntax / compile probes such as py_compile, compileall, or node -c. 183 18
Build / compile / typecheck
compile-build
Build-ish commands like go build, go vet, make, tsc, npx tsc, npm run build, yarn build; Pi also captures repo-native checks like npm run check, biome, eslint, or tsgo. 1,088 41 22486
Inline verify / assertion probe
run-inline-verify
Inline tsx/node/python snippet that imports project code or runs ad hoc assertions / prints as a behavior check. 999 696 12129
git (cleanup)
Review the current diff
SWEgit-diff
Pigit-diff-review
git diff review of the current changes. 538 23 4531
Inspect repo state
SWEgit-status-log
Pigit-repo-inspect
Local repo inspection such as git status, git show, git log; Pi also folds in things like git branch and git worktree. 652 23 281274
Change local repo state
SWEgit-stash
Pigit-local-state-change
Mutating local git state. SWE only breaks out git stash; Pi groups a broader set like git add, commit, stash, reset, checkout, and switch. 28 0 20180
Read or update GitHub task context
Pigit-github-context
Pi-only GitHub workflow via gh issue, gh pr, or gh api. 382389
Sync or integrate upstream changes
Pigit-sync-integrate
Pi-only integration work like git fetch, pull, rebase, merge, or cherry-pick. 6326
Publish finished work
Pigit-publish
Pi-only publish step: git push. 7133
housekeeping (cleanup)
General file cleanup
file-cleanup
Filesystem cleanup / movement such as rm, mv, cp, or chmod. 1,554 17 2517
Create documentation / summary artifact
create-documentation
Create documentation-like files whose names match *summary*, *readme*, *changes*, or *implementation*. In Pi this comes from the write tool using those doc-like filenames. 661 2 01
Start a service or wait process
start-service
Environment setup commands such as redis-server, redis-cli, mongod, or sleep. 26 4 13
Install dependencies
install-deps
Package install / env setup such as pip install, pip list, npm install, go get, or apt. 20 0 180
Check whether a tool exists
SWEcheck-tool-exists
Capability probe via which or type. 16 2
Manage a tmux / background session
Pitmux-session
Pi-only tmux usage for long-running / detached processes. 111
failed (ignored)
Search command failed at the shell level
search-keyword(failed)
A grep/find-style search whose observation shows a shell-level error. 46 2,748 12
Read attempt failed
SWEread-via-bash(failed)
Piread-file-failed
The attempted read failed: SWE on shell readers like cat/head/sed/tail/ls; Pi on the read tool itself (missing path, permission, etc.). 23 994 1210
Script run failed
run-script(failed)
python/node execution whose observation shows a shell-level error. 47 759 08
Test-runner command failed
SWErun-test-suite(failed)
A test runner command whose observation shows a shell-level error. 6 155
Source edit failed
Piedit-source(failed)
Pi-only edit tool failure on a source file. 517
Test / repro edit failed
Piedit-test-or-repro(failed)
Pi-only edit tool failure on a test / repro / verification-support file. 28
Generic bash command failed
bash-command(failed)
Residual failed shell command after the more specific failure buckets are ruled out. 32 1,217 1114
other (ignored)
Echo / printf
echo
Output-only commands like echo or printf. 140 69 810
Other unclassified bash
bash-other
Final fallback for bash commands that matched no more specific rule. 928 631 7966
Fetch a URL / call an HTTP endpoint
Pifetch-url
Pi-only curl / HTTP request step. 211
Submit the patch
SWEsubmit
SWE-only terminal action whose first line starts with submit. 656 537
Empty action / context-window exit
SWEempty
SWE-only blank action string, typically rate-limit or context-window exit. 770 854
Undo an editor change
SWEundo-edit
SWE-only str_replace_editor undo_edit action. 4 39

The label describes what the command is, derived deterministically from tool calls, filenames, command heads, and simple output heuristics. No positional context (before/after first edit) and no model-side intent inference is used.

(failed) variants classify by intended action, not outcome quality. They require a shell-level or tool-level failure on that side’s classifier.

run-inline-snippet remains a residual bucket. Inline snippets (python -c, python - <<, node -e) are first routed to more specific inline read/edit/verify buckets when their code shape makes that obvious.

Canonical benchmark source: scripts/classify_intent.py and docs/intent-classification-rules.md. The Pi side follows the deterministic labeler used in the Pi reference tables.

5. Mario’s analysis prompt

Every Pi session in the analyzed set kicks off with this prompt. The <issue-number> is filled in per session; Mario steers from there.

Analyze GitHub issue(s): https://github.com/badlogic/pi-mono/issues/<issue-number> you will have to pull down the image and read it as well to understand.

For each issue:

1. Read the issue in full, including all comments and linked issues/PRs.

2. **For bugs**:
   - Ignore any root cause analysis in the issue (likely wrong)
   - Read all related code files in full (no truncation)
   - Trace the code path and identify the actual root cause
   - Propose a fix

3. **For feature requests**:
   - Read all related code files in full (no truncation)
   - Propose the most concise implementation approach
   - List affected files and changes needed

Do NOT implement unless explicitly asked. Analyze and propose only.

6. SWE-agent issue-resolution prompt

For comparison, the SWE-agent setup used in SWE-Bench Pro includes the following issue-resolution scaffold in its default prompt (source):

Follow these steps to resolve the issue:
1. As a first step, it might be a good idea to find and read code relevant to the <pr_description>
2. Create a script to reproduce the error and execute it with `python <filename.py>` using the bash tool, to confirm the error
3. Edit the source code of the repo to resolve the issue
4. Rerun your reproduce script and confirm that the error is fixed!
5. Think about edgecases and make sure your fix handles them as well