Trajectory shapes

Why are we managing our coding agents based on vibes instead of their actual work habits?

Leaderboards tell us how many tasks a model solved but not how. The trajectories that they leave behind tell the story of how, but these aren’t studied as much. The tools a model prefers, and the overall “shape” of its workflow is visible through an empirical lens.

I analyzed the latest SWE-Bench Pro trajectories I could find: runs from October 2025 for Sonnet 4.5 and GPT-5, 730 task trajectories per model. I deterministically classified each step into activities like understand, edit, verify, and cleanup using only tool calls, literal command/filename matches, and regex heuristics, and then computed their share over time. This “trajectory shape” chart I got as a result is very interesting!

Here are the main work habits I see in this chart:

Sonnet starts editing early (at 35%), GPT starts editing much later (at 50%)
Sonnet is done with the implementation early (at 62%)
Sonnet spends a LOT of time verifying after the last source edit
Sonnet has to cleanup temporary files, GPT doesn’t have to
GPT-5 reads a LOT to front-load context, before it starts editing (at 50%)
GPT-5 barely does any verification

If I were to condense the above to “vibes”, I might say “Claude starts editing early and figures it out in the loop. GPT reads first, then goes for the one-shot.” That is also close to what people on my timeline were saying when these runs were current. For example, Eric Provencher wrote:

My take with codex vs claude code

- Use codex if you have a detailed plan and want to walk away for 30 min. If you try and interrupt and iterate codex, you might as well restart from scratch
- Use claude code if youre not 100% sure where you're going and want to iterate
— eric provencher (@pvncher) October 13, 2025

Now I can say the same thing with data, and with a better sense of the nuances. And once we see these habits clearly, we can stop managing agents on vibes and start steering them on evidence.

With newer models and a maintainer

I wanted to replicate this chart for Opus 4.6 and GPT-5.4, but the SWE-Bench Pro trajectories (or any other SWE-agent trajectories) for them are unavailable. I then remembered that Mario Zechner has been publishing his Pi trajectories on Hugging Face, and downloaded them.

Luckily, he works through GitHub issues methodically. Each session is one issue, and one model. It kicks off with the same analysis prompt, waits for his go ahead, there is a varied amount of steering to address it, and then he wraps up explicitly to ship a fix, close the issue, leave a triage comment, and so on. Each one starts with git work (read the PR, read the comments) and ends with git work (push the change, close the issue, or do the final triage), with the source-editing loop in between.

While the issue fixes make for similar trajectories to SWE-Bench Pro, I’m still changing the models, harness, and adding a maintainer, so this is not exactly an apples-to-apples comparison. To keep the trajectory-shape comparison closer to the benchmark, the Pi panels below use only the strict single-model sessions whose final title starts with Issue:. That said, the observations from the previous section still hold. The bars below each Pi panel give a rough sense of where Mario is steering more during the run; taller bars mean more intervention around that part of the trajectory. The labels mark Mario’s go ahead and wrap up moments ().

I think it’s likely that the explicit analysis prompt is pushing the understand phase to be longer. The first edit is pushed from 35% to 47%. After that, there’s a fair amount of steering, QA, critique, and validation, with verify + edit cycles, after which it wraps up. The push to verify in the SWE-agent prompt likely causes the benchmark trajectory to go on longer, and is fairly redundant.

As with Claude, the analysis prompt makes GPT go on even longer, pushing the first edit from 50% to 63%. However, the edit duration has reduced significantly from 40% to 23%. I suspect this is a combination of the model upgrade and Mario’s steering towards “minimal” and “concise” solutions.

Why compare trajectory shapes?

Trajectories are traces of learned workflows. RL shapes both micro preferences, like which tools a model uses, and macro preferences, like the overall workflow it follows from problem statement to verified solution. To study those macro patterns, one trajectory is not enough; you need a corpus. And to compare a corpus, you need normalization. So, each trajectory is mapped onto the same 0–100% start-to-finish frame so that the aggregate shape reflects the learned workflow. For instance, see the difference in trajectory lengths of the SWE-Bench Pro:

Mean steps as the dot, p25–p75 as the bar, on a fixed 0–100 scale.

0 50 100

steps

SWE agent Sonnet 4.5

730 tasks · 43.7% resolved

77.5

63–89

SWE agent GPT-5

730 tasks · 36.3% resolved

59.5

34–76

Mario + Pi · Opus 4.5/4.6

49 sessions · 98.0% completed

30.9

16–35

Mario + Pi · GPT-5.4

77 sessions · 96.1% completed

40.6

27–52

Further, tool calls are normalized by structural intent rather than literal tool identity, so that equivalent actions can be compared across harnesses. In this study, I normalised tool calls across SWE-agent and Pi, and the table below documents that taxonomy.

So, if you recognize these patterns in your agents, and they don't fit your task at hand, you could steer them accordingly.

Observed pattern	Possible steering move
Claude starts editing early	Ask for explicit analysis first
Claude finishes implementation early	Ask it to verify before concluding
Claude over-verifies after the last edit	Specify exact success criteria
Claude leaves cleanup behind	Tell it not to create temporary docs or scripts
GPT front-loads context gathering	Usually fine. Or narrow what context it should gather
GPT is not verifying enough	Ask for red-green TDD

Appendix

Analysis code and data: github.com/nilenso/swe-bench-pro-cost-token-time-analysis.

SWE-Bench Pro

Pi transcripts

1. GPT-5’s tool failures

I have ignored these failures for the purpose of the charts, but they are significant enough to mention here. GPT-5's tool calls fail in about 20% of the steps throughout the SWE-agent trajectory. Most notably, it expects the apply_patch tool to be present, and keeps calling that. apply_patch is the main editing tool within Codex, and GPT-5’s own system prompts inside Codex had counterweights for this behavior. GPT-5 manages to work through the failures and still achieve a decent resolution rate on the task. However, we do not know what effect those failing tool calls have on the outcome.

Example failures and data: failure modes in the reference appendix.

2. High-level action frequencies

On SWE-Bench Pro, even if we ignore when actions happen and just count what the models spend their steps on, the same signature shows up. GPT-5 spends much more of its trajectory reading, while Sonnet 4.5 spends much more of it verifying. That makes the difference in the trajectory-shape chart feel less like a visual artefact and more like a stable workflow habit.

To “understand”, GPT reads more, while both models search about the same amount
Verification leans the other way. Sonnet 4.5 runs ~3 verify steps per edit, GPT-5 runs ~1 verify step per 3 edits
Git + housekeeping is effectively Sonnet-only (6.2% vs 0.2%). GPT-5’s trajectory has no bookkeeping tail at all.
Pre-edit information gathering (read + search + reproduce) is 61% of GPT-5’s steps vs 49% of Sonnet’s. GPT-5 front-loads context; Sonnet spreads work more evenly across the trajectory.

Around 40% of GPT-5 trajectories don’t do ANY verification at all
Sonnet 4.5 re-runs the same passing tests many times, without any changes in source

3. Models have tool preferences

On SWE-Bench Pro, the tools provided in the RLVR environment likely shape the tool preferences of models. Here are the various tools used per category, compared across Sonnet 4.5 / GPT-5. I wish I had SWE-Bench Pro trajectories for Opus 4.6 and GPT-5.4. With the same tasks and harness, I could compare tool-call preferences, costs, latencies, and more.

4. Intent Classification Taxonomy

The benchmark charts above and the Pi / Mario charts later in the post use the same high-level taxonomy, but not always the same raw labels. So this appendix merges both into one deterministic reference table. Raw labels stay in monospace under each intent class so chart clicks can still land on the exact label.

Unlike the trajectory-shape panels, which use the stricter issue-only Pi subset, the Pi counts in this appendix use the broader all strict single-model sessions cut: 171 analysed Opus 4.5/4.6 trajectories and 133 analysed GPT-5.4 trajectories. The benchmark side remains the full SWE-Bench Pro pair with 730 trajectories per model.

How to read this table:

Both sides are deterministic. No model inference is used in either taxonomy.
Rows are merged by intent class. If SWE-agent and Pi use the same raw label, it appears once. If they use different labels for the same kind of action, both labels appear under the same row with SWE / Pi badges.
Blank counts mean no separate label on that side. For example, Pi has no distinct insert-source bucket, while Pi adds git workflow labels like git-github-context and git-publish.
Counts are per dataset. The left pair is the SWE-agent benchmark set (730 Sonnet 4.5 trajectories, 730 GPT-5 trajectories). The right pair is the all-single-model Pi set (171 Opus 4.5/4.6 trajectories, 133 GPT-5.4 trajectories).
The trajectory-shape panels are narrower. Earlier Pi charts stay on the issue-only subset because that keeps the comparison closer to the benchmark’s issue-fix workflow.

intent class	description	SWE-agent		Pi (all single-model)
intent class	description	Sonnet 4.5 730 trajs	GPT-5 730 trajs	Opus 4.5/4.6 171 trajs	GPT-5.4 133 trajs
read (understand)
View an entire file `read-file-full`	SWE: `str_replace_editor view <file>` after test/config/range/truncation cases are ruled out. Pi: whole-file `read(path)`.	3,125	5,020	455	803
View a specific range `read-file-range`	SWE uses `--view_range`; Pi uses `read(path, offset, limit)`.	5,974	5,997	524	546
Whole-file read, but truncated `read-file-full(truncated)`	File was too large to show in full; the read/view output was abbreviated.	198	245	119	109
Read a config / manifest file `read-config-file`	Filename match such as `package.json`, `pytest.ini`, `setup.cfg`, `setup.py`, `go.mod`, `Makefile`, or `config.json`.	26	206	19	28
Read a test file as a distinct class SWE`read-test-file`	SWE-only filename match for `test_`, `_test.`, or `conftest`.	644	635	—	—
Read via shell command `read-via-bash`	Shell read commands like `cat`, `head`, `tail`, `sed -n`, `nl`, or `awk`.	2,345	2,974	76	14
Read via inline snippet `read-via-inline-script`	Inline code does a pure read-and-print, e.g. `.read()`, `open(...,'r')`, or `readFileSync`, with no write.	76	373	0	10
Browse a directory through the editor interface SWE`view-directory`	SWE-only: `str_replace_editor view` where the path has no extension, or the observation says “files and directories”.	1,137	2,133	—	—
search (understand)
List a directory from the shell `list-directory`	Directory / cwd inspection via `ls`, `tree`, or `pwd`.	843	708	85	11
Search for a keyword / pattern `search-keyword`	Pattern search via `grep`, `rg`, or `ag`.	7,002	6,499	699	584
Find files by name `search-files-by-name`	`find ... -name` without a grep / xargs content-search pipeline.	1,792	49	47	34
Find files by content via find/grep pipelines `search-files-by-content`	`find ... -exec grep` or `find ... \| xargs grep`.	3,254	10	25	2
Inspect file metadata `inspect-file-metadata`	Metadata checks like `wc`, `file`, or `stat`.	246	22	4	58
Check runtime / tool version SWE`check-version`	Tiny version probes such as `--version`, `-V`, `sys.version`, or `node -v`.	6	2	—	—
Search the web Pi`web-search`	Pi-only external lookup via the `brave-search` skill / `search.js`-style web search call.	—	—	2	5
reproduce
Create a repro artifact `create-repro-script`	Create a file whose name matches `repro`, `reproduce`, or `demo*`.	157	463	0	7
Run a repro artifact `run-repro-script`	Run a named `python`/`node`/`sh`/`bash`/`go run` script whose basename matches `repro` / `reproduce` / `demo*`.	375	1,067	0	7
Run a residual inline snippet `run-inline-snippet`	Inline `python -c`, `python - <<`, or `node -e` that did not match a more specific inline read/edit/verify pattern.	472	193	2	18
edit
Edit existing source `edit-source`	Source-file edit on a path that does not match test / repro / verify / check heuristics.	5,217	4,983	528	337
Insert into source SWE`insert-source`	SWE-only `str_replace_editor insert` action.	12	803	—	—
Apply a patch blob SWE`apply-patch`	SWE-only `applypatch` command path, mostly GPT-specific.	0	94	—	—
Create a new non-test file `create-file`	Create/write a file that does not match repro / test / verify / documentation filename heuristics.	595	326	38	52
Edit via inline script `edit-via-inline-script`	Inline script reads a file, changes text via things like `.replace()` or `re.sub()`, then writes it back.	5	245	0	3
Create a file via inline script `create-file-via-inline-script`	Inline script writes a file with no prior read.	21	41	3	19
verify
Run a broad test suite `run-test-suite`	Broad runner commands such as `pytest`, `go test`, `npm test`, `jest`, `mocha`, `yarn test`, or `python -m unittest`.	5,942	585	45	2
Run targeted tests `run-test-specific`	Test command narrowed by `::`, `-k`, or an explicit file/filter target.	1,105	370	23	51
Create a regression / test file `create-test-script`	Create a file such as `test_`, `test.py`, `test.js`, or `test.go`.	2,633	18	9	13
Run a named verify / check script SWE`run-verify-script`	Run a named script whose basename contains `test_`, `verify`, `check`, `validate`, or `edge_case`.	3,420	113	—	—
Create a named verify / check script SWE`create-verify-script`	Create a file matching `verify`, `check`, or `validate*`.	321	47	—	—
Edit a test or repro file `edit-test-or-repro`	Edit a file whose path/name matches test / repro / verify / check heuristics.	712	243	47	60
Run a custom named script `run-custom-script`	Run a named `python`/`node`/`sh`/`bash`/`go` script that does not match repro/test/verify patterns.	476	111	31	28
Syntax-only check SWE`syntax-check`	Syntax / compile probes such as `py_compile`, `compileall`, or `node -c`.	183	18	—	—
Build / compile / typecheck `compile-build`	Build-ish commands like `go build`, `go vet`, `make`, `tsc`, `npx tsc`, `npm run build`, `yarn build`; Pi also captures repo-native checks like `npm run check`, `biome`, `eslint`, or `tsgo`.	1,088	41	224	86
Inline verify / assertion probe `run-inline-verify`	Inline `tsx`/`node`/`python` snippet that imports project code or runs ad hoc assertions / prints as a behavior check.	999	696	12	129
git (cleanup)
Review the current diff SWE`git-diff` Pi`git-diff-review`	`git diff` review of the current changes.	538	23	45	31
Inspect repo state SWE`git-status-log` Pi`git-repo-inspect`	Local repo inspection such as `git status`, `git show`, `git log`; Pi also folds in things like `git branch` and `git worktree`.	652	23	281	274
Change local repo state SWE`git-stash` Pi`git-local-state-change`	Mutating local git state. SWE only breaks out `git stash`; Pi groups a broader set like `git add`, `commit`, `stash`, `reset`, `checkout`, and `switch`.	28	0	201	80
Read or update GitHub task context Pi`git-github-context`	Pi-only GitHub workflow via `gh issue`, `gh pr`, or `gh api`.	—	—	382	389
Sync or integrate upstream changes Pi`git-sync-integrate`	Pi-only integration work like `git fetch`, `pull`, `rebase`, `merge`, or `cherry-pick`.	—	—	63	26
Publish finished work Pi`git-publish`	Pi-only publish step: `git push`.	—	—	71	33
housekeeping (cleanup)
General file cleanup `file-cleanup`	Filesystem cleanup / movement such as `rm`, `mv`, `cp`, or `chmod`.	1,554	17	25	17
Create documentation / summary artifact `create-documentation`	Create documentation-like files whose names match `summary`, `readme`, `changes`, or `implementation`. In Pi this comes from the `write` tool using those doc-like filenames.	661	2	0	1
Start a service or wait process `start-service`	Environment setup commands such as `redis-server`, `redis-cli`, `mongod`, or `sleep`.	26	4	1	3
Install dependencies `install-deps`	Package install / env setup such as `pip install`, `pip list`, `npm install`, `go get`, or `apt`.	20	0	18	0
Check whether a tool exists SWE`check-tool-exists`	Capability probe via `which` or `type`.	16	2	—	—
Manage a tmux / background session Pi`tmux-session`	Pi-only `tmux` usage for long-running / detached processes.	—	—	1	11
failed (ignored)
Search command failed at the shell level `search-keyword(failed)`	A `grep`/`find`-style search whose observation shows a shell-level error.	46	2,748	1	2
Read attempt failed SWE`read-via-bash(failed)` Pi`read-file-failed`	The attempted read failed: SWE on shell readers like `cat`/`head`/`sed`/`tail`/`ls`; Pi on the `read` tool itself (missing path, permission, etc.).	23	994	12	10
Script run failed `run-script(failed)`	`python`/`node` execution whose observation shows a shell-level error.	47	759	0	8
Test-runner command failed SWE`run-test-suite(failed)`	A test runner command whose observation shows a shell-level error.	6	155	—	—
Source edit failed Pi`edit-source(failed)`	Pi-only `edit` tool failure on a source file.	—	—	5	17
Test / repro edit failed Pi`edit-test-or-repro(failed)`	Pi-only `edit` tool failure on a test / repro / verification-support file.	—	—	2	8
Generic bash command failed `bash-command(failed)`	Residual failed shell command after the more specific failure buckets are ruled out.	32	1,217	11	14
other (ignored)
Echo / printf `echo`	Output-only commands like `echo` or `printf`.	140	69	8	10
Other unclassified bash `bash-other`	Final fallback for bash commands that matched no more specific rule.	928	631	79	66
Fetch a URL / call an HTTP endpoint Pi`fetch-url`	Pi-only `curl` / HTTP request step.	—	—	21	1
Submit the patch SWE`submit`	SWE-only terminal action whose first line starts with `submit`.	656	537	—	—
Empty action / context-window exit SWE`empty`	SWE-only blank action string, typically rate-limit or context-window exit.	770	854	—	—
Undo an editor change SWE`undo-edit`	SWE-only `str_replace_editor undo_edit` action.	4	39	—	—

The label describes what the command is, derived deterministically from tool calls, filenames, command heads, and simple output heuristics. No positional context (before/after first edit) and no model-side intent inference is used.

(failed) variants classify by intended action, not outcome quality. They require a shell-level or tool-level failure on that side’s classifier.

run-inline-snippet remains a residual bucket. Inline snippets (python -c, python - <<, node -e) are first routed to more specific inline read/edit/verify buckets when their code shape makes that obvious.

Canonical benchmark source: scripts/classify_intent.py and docs/intent-classification-rules.md. The Pi side follows the deterministic labeler used in the Pi reference tables.

5. Mario’s analysis prompt

Every Pi session in the analyzed set kicks off with this prompt. The <issue-number> is filled in per session; Mario steers from there.

Analyze GitHub issue(s): https://github.com/badlogic/pi-mono/issues/<issue-number> you will have to pull down the image and read it as well to understand.

For each issue:

1. Read the issue in full, including all comments and linked issues/PRs.

2. **For bugs**:
   - Ignore any root cause analysis in the issue (likely wrong)
   - Read all related code files in full (no truncation)
   - Trace the code path and identify the actual root cause
   - Propose a fix

3. **For feature requests**:
   - Read all related code files in full (no truncation)
   - Propose the most concise implementation approach
   - List affected files and changes needed

Do NOT implement unless explicitly asked. Analyze and propose only.

6. SWE-agent issue-resolution prompt

For comparison, the SWE-agent setup used in SWE-Bench Pro includes the following issue-resolution scaffold in its default prompt (source):

Follow these steps to resolve the issue:
As a first step, it might be a good idea to find and read code relevant to the <pr_description>
Create a script to reproduce the error and execute it with `python <filename.py>` using the bash tool, to confirm the error
Edit the source code of the repo to resolve the issue
Rerun your reproduce script and confirm that the error is fixed!
Think about edgecases and make sure your fix handles them as well