Checking my model vibes against SWE-Bench Pro

If you choose coding models for production work, resolve rate is only part of the decision. Cost, tokens, and runtime decide whether a model is workable day to day. My intuition was: GPT models felt slow and token hungry, Claude models felt faster on similar tasks, so Claude should be cheaper. Similar claims appear in public writeups, for example in the OpenHands Index, where Opus 4.5 is noted as finishing tasks quickly despite its size.

Some benchmark operators already publish this view. SWE-rebench and OpenHands Index include cost and runtime alongside accuracy metrics. I focused on SWE-Bench Pro because it is the one I understand best. Our earlier write-up covers why Pro is the best available alternative, and OpenAI has made the same shift. SWE-Bench Pro publishes trajectories, but it does not publish a consolidated cost-token-time report.

So I built one from the public data. One important choice is in pairing: I compare each task only when both models submitted on that same instance, then compute Sonnet 4.5 / GPT-5 ratios per task and summarize those ratios. This keeps the comparison grounded because every ratio comes from the same task under the same harness setup. I analyzed October 2025 paired runs in SWE-Bench Pro, and I would like to run the same analysis on current leaderboard pairs like GPT-4 vs Opus-4.6 once comparable paired data is available.

My intuitions were wrong.

High-level results

Loading chart...

Across 616 paired tasks, resolve rates are close while operating profile is not. The results below come from the October 2025 GPT-5 vs Sonnet 4.5 paired trajectories.

  • Cost: median ratio 6.33x (Sonnet 4.5 / GPT-5)
  • Total tokens: median ratio 1.15x
  • Tool time: median ratio 1.35x
  • Resolve rate: GPT-5 42.5%, Sonnet 4.5 44.5%

Accuracy is similar here. The larger differences show up in cost, tokens, and runtime.

Cost

Loading chart...

Cost shows the largest gap in this dataset. Sonnet 4.5 is more expensive on most tasks, with many tasks clustered between 4x and 10x.

Caveats:

  • Costs come from benchmark run logs, not a rerun done today.
  • These are litellm proxy costs from the benchmark environment, not public list pricing.
  • Discounts, caching, and contract terms can change absolute dollars.
  • This is one benchmark setup, not every real-world coding workflow.

Tokens

Loading chart...

Total token usage is mixed and much less extreme than cost. Sonnet 4.5 uses more total tokens in 58.0% of tasks, with a broad spread around 1x. The largest token difference is in output, not input, and in the deeper breakdown Sonnet 4.5 emits much more output per task; a lot of that output is tool-call content and temporary file creation that gets discarded and never appears in the final submitted patch.

Caveats:

  • This chart uses total tokens from run stats: tokens_sent + tokens_received.
  • SWE-Agent’s built-in tokens_received only counts message.content and misses tool-call arguments, so in the deeper token breakdown I re-count output tokens with tiktoken (cl100k_base) across message.content, message.thought, and tool_calls[].function.arguments.
  • Hidden reasoning tokens are not visible in trajectories.
  • Token behavior depends on harness strategy, turn limits, and repo shape.

Time

Loading chart...

Time is more comparable than cost and noisier across tasks. Sonnet 4.5 is slower in 58.9% of tasks, with a median ratio of 1.35x.

Caveats:

  • This is tool execution time from trajectories.
  • It is not full wall-clock latency.
  • Model inference latency is not fully captured in this metric.
  • Test-suite length and repo workflows add variance.

Conclusion

This SWE-Bench Pro slice is a signal, not a universal ranking. I do not expect October 2025 GPT-5 and Sonnet 4.5 behavior to generalize cleanly to newer model releases, but it is enough to challenge my default, so I will be trying GPT models more often now. The broader takeaway is to measure on your own use case with cost, token, and time data, and avoid decisions driven by timeline takes or out-of-context benchmark reporting.

Annex: methodology and additional data

This annex lists implementation details, additional ratio charts, and full comparative tables for the same paired set used above.

Analysis code and data: github.com/nilenso/swe-bench-pro-cost-token-time-analysis.

A.1 Data and pairing protocol

  • Source: public SWE-Bench Pro trajectories from Scale AI.
  • Pairing unit: same instance_id, requiring both GPT-5 and Sonnet 4.5 to have submitted=true.
  • Paired sample size: 616 instances.
  • Unit of comparison: per-instance ratio, Sonnet 4.5 / GPT-5.
  • Aggregation: medians and IQR over per-instance ratios, not pooled cross-task means.
  • Scope: October 2025 model runs in one harness configuration.
  • Environment note: costs are benchmark-environment proxy costs, not current list pricing.

A.1.1 Inclusion and denominator audit

All annex charts and tables use the same paired set unless stated otherwise.

MetricIncluded (n)ExcludedInclusion rule
Cost ratio6160both costs > 0
Total token ratio (`tokens_sent + tokens_received`)6160both totals > 0
Tool-time ratio6160both tool times > 0
Input token ratio6160both `tokens_sent` > 0
Output token ratio (`output_tokens.total`)6160both totals > 0
Step-count ratio6160both step counts > 0

Note: denominator stays constant across ratio charts, so cross-metric comparisons are on the same paired sample.

A.1.2 Paired outcome decomposition

  • Both solved: 206
  • GPT-5 only solved: 56
  • Sonnet 4.5 only solved: 68
  • Neither solved: 286

Note: this decomposition separates overlap from directional wins on the same instances.

A.1.3 Uncertainty checks

  • Resolve rate, Wilson 95% CI:
    • GPT-5: 42.5% (262/616), CI 38.7%-46.5%
    • Sonnet 4.5: 44.5% (274/616), CI 40.6%-48.4%
  • Paired discordant outcomes: GPT-only 56, Sonnet-only 68; McNemar exact p-value 0.323.
  • Median ratio 95% bootstrap CI (4,000 resamples):
    • Cost: 6.33x, CI 5.91x-6.68x
    • Total tokens: 1.15x, CI 1.06x-1.26x
    • Tool time: 1.35x, CI 1.18x-1.55x

Note: in this paired slice, cost, total-token, and tool-time ratio signals are stable; resolve-rate differences are less decisive.

A.2 Additional ratio charts (not shown above)

Input token ratio per paired task.

Loading chart...

Output token ratio per paired task.

Loading chart...

Step-count ratio per paired task.

Loading chart...

A.3 Comparative tables

Cost summary across all paired tasks.

Loading table...

Token and patch summary.

Loading table...

Action breakdown summary.

Loading table...

Execution summary.

Loading table...

Repo-level resolve counts.

Loading table...

Per-instance comparison table.

Loading table...