SWE-Bench Analyser
==================
# Data
## Submissions
Take SWE-bench Verified experiments: https://github.com/SWE-bench/experiments/tree/main/evaluation/verified
(See https://openai.com/index/introducing-swe-bench-verified/)
typical metadata.yml (https://github.com/SWE-bench/experiments/blob/main/evaluation/verified/20251103_sonar-foundation-agent_claude-sonnet-4-5/metadata.yaml):
```yaml
assets:
logs: s3://swe-bench-submissions/verified/20251103_sonar-foundation-agent_claude-sonnet-4-5/logs
trajs: s3://swe-bench-submissions/verified/20251103_sonar-foundation-agent_claude-sonnet-4-5/trajs
info:
authors: Haifeng Ruan, Yuntong Zhang
logo: https://assets-eu-01.kc-usercontent.com/55017e37-262d-017b-afd6-daa9468cbc30/8e59bcad-6e39-41dc-abd9-a0e251e8d63f/Sonar%20%282%29.svg?w=128&h=32&dpr=2&fit=crop&q=80
name: Sonar Foundation Agent + Claude 4.5 Sonnet
report: https://www.sonarsource.com/blog/introducing-sonar-foundation-agent/
site: https://www.sonarsource.com
tags:
checked: false
model:
- claude-sonnet-4-5
org:
- Sonar
os_model: false
os_system: false
system:
attempts: 1
```
Submission id: dir name, e. g. `20251103_sonar-foundation-agent_claude-sonnet-4-5`
Date: get from the id (e.g. `20251103`)
## Results
Data dir: https://github.com/SWE-bench/experiments/tree/main/evaluation/verified/20251103_sonar-foundation-agent_claude-sonnet-4-5/results
results.json: https://github.com/SWE-bench/experiments/blob/main/evaluation/verified/20251103_sonar-foundation-agent_claude-sonnet-4-5/results/results.json
- contains a list of resolved issues
Example result:
```json
{
"no_generation": [
"django__django-14631",
"django__django-15037",
"sympy__sympy-13877"
],
"no_logs": [],
"resolved": [
"astropy__astropy-12907",
"astropy__astropy-13033",
"astropy__astropy-13453",
"astropy__astropy-13579",
"astropy__astropy-14096",
"astropy__astropy-14309",
"astropy__astropy-14508",
"astropy__astropy-14539",
...
"sympy__sympy-24562",
"sympy__sympy-24661"
]
}
```
## Dataset
https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified
Datum structure:
```
instance_id: (str) - A formatted instance identifier, usually as repo_owner__repo_name-PR-number.
patch: (str) - The gold patch, the patch generated by the PR (minus test-related code), that resolved the issue.
repo: (str) - The repository owner/name identifier from GitHub.
base_commit: (str) - The commit hash of the repository representing the HEAD of the repository before the solution PR is applied.
hints_text: (str) - Comments made on the issue prior to the creation of the solution PR’s first commit creation date.
created_at: (str) - The creation date of the pull request.
test_patch: (str) - A test-file patch that was contributed by the solution PR.
problem_statement: (str) - The issue title and body.
version: (str) - Installation version to use for running evaluation.
environment_setup_commit: (str) - commit hash to use for environment setup and installation.
FAIL_TO_PASS: (str) - A json list of strings that represent the set of tests resolved by the PR and tied to the issue resolution.
PASS_TO_PASS: (str) - A json list of strings that represent tests that should pass before and after the PR application.
```
# Features
## Display current leaderboard
Table of experiments/submissions with columns:
- Name
- Number of resolved issues
- % of resolved issues
- Model
- Model family (such as Claude, ChatGPT, etc): detect from the model id
- Org
- Open Weights: os_model
- Open Scaffold: os_system
- Checked
- Site
- Github submission: link to the yaml file
Can sort by any column. Default: number of resolevd issues
Can filter by:
- model
- model family
- org
- open weights
- open scaffold
- checked
- name (substring)
## Explorer
### Panoramic view
The table has header rows/columns (don't move on scrolling and sorting) and data cells.
Table of issues from the dataset, with columns:
- issue (header col): instance_id
- score (header col): number of submissions that resolve this issue
- submission: column for every submission
The header row below the column names shows for every submission its score: number of issues it resolves.
In a data cell for a given instance/submission:
- green if this issue is resolved in that submission
- red if not
- text if it's marked as 'no_generation', 'no_logs', etc
The instance id cell changes color depending on the proportion green cells are in its row: from green (all cells are green) to red (all cells are red).
The submission cell's color analogously reflects the number of green/red cells in its column.
The table is read-only for the user.
At most one cell is in focus, the user can move the focus around the table by clicking or keyboard up/down/left/right keys.
The user can select cells in the normal way by using Shift for continuous ranges and Cmd for individual cells.
The user can copy the contents of the selection, to be pastable to applications like Excel/GSheets or text editors.
### Views: Filters / Sorting
Filter columns (submissions) by:
- model
- model family
- open weights
- open scaffold
- checked
- id substring
- year
- month (without year)
- full date
- before a given date (<=)
- after a given date (>=)
- \# of resolved issues (=, >=, <=)
- % of resolved issues (=, >=, <=)
Filter rows (issues):
- repo (multiple choice)
- name (substring)
Filters do affect the scores for submissions and for issues.
Rows can be sorted (asc/desc) by
- id (default: asc)
- score
Columns can be sorted (asc/desc) by
- date (default: desc, i.e. latest on the left)
- score
The user can save a View (filters + sorting) to re-use it later.
### Reports
For a given Explorer view (defined by filters/sorting):
- map: the same table, but no col headers, no text in data cells, and every data cell is a coloured square
- histograms:
- number of resolved issue per submission
- number of resolving submission per issue
- classes of problems: easy, medium, hard, saturated, unsolved, etc
- https://jatinganhotra.dev/blog/swe-agents/2025/04/15/swe-bench-verified-easy-medium-hard.html
- https://jatinganhotra.dev/blog/swe-agents/2025/06/05/swe-bench-verified-discriminative-subsets.html
- model perf: bubble chart of the best submission (+ color of how much a harness matters)
- harness perf: bubble chart of best submissions
- history charts:
- best submission score over time
- problem hardness over time (% of resolving solutions within a quarter)
- model family power (best score) over time:
## Display Spec
The top menu has an item to show this spec document as highlighted markdown source (not rendered markdown).
# Future benchmarks
- https://multi-swe-bench.github.io/#/