Spec

Project specification (spec-prd.md)

SWE-Bench Analyser
==================

# Data

## Submissions

Take SWE-bench Verified experiments: https://github.com/SWE-bench/experiments/tree/main/evaluation/verified 

(See https://openai.com/index/introducing-swe-bench-verified/)

typical metadata.yml (https://github.com/SWE-bench/experiments/blob/main/evaluation/verified/20251103_sonar-foundation-agent_claude-sonnet-4-5/metadata.yaml):
```yaml
assets:
  logs: s3://swe-bench-submissions/verified/20251103_sonar-foundation-agent_claude-sonnet-4-5/logs
  trajs: s3://swe-bench-submissions/verified/20251103_sonar-foundation-agent_claude-sonnet-4-5/trajs
info:
  authors: Haifeng Ruan, Yuntong Zhang
  logo: https://assets-eu-01.kc-usercontent.com/55017e37-262d-017b-afd6-daa9468cbc30/8e59bcad-6e39-41dc-abd9-a0e251e8d63f/Sonar%20%282%29.svg?w=128&h=32&dpr=2&fit=crop&q=80
  name: Sonar Foundation Agent + Claude 4.5 Sonnet
  report: https://www.sonarsource.com/blog/introducing-sonar-foundation-agent/
  site: https://www.sonarsource.com
tags:
  checked: false
  model:
  - claude-sonnet-4-5
  org:
  - Sonar
  os_model: false
  os_system: false
  system:
    attempts: 1
```

Submission id: dir name, e. g. `20251103_sonar-foundation-agent_claude-sonnet-4-5`
Date: get from the id (e.g. `20251103`)

## Results

Data dir: https://github.com/SWE-bench/experiments/tree/main/evaluation/verified/20251103_sonar-foundation-agent_claude-sonnet-4-5/results

results.json: https://github.com/SWE-bench/experiments/blob/main/evaluation/verified/20251103_sonar-foundation-agent_claude-sonnet-4-5/results/results.json
- contains a list of resolved issues

Example result:
```json
{
  "no_generation": [
    "django__django-14631",
    "django__django-15037",
    "sympy__sympy-13877"
  ],
  "no_logs": [],
  "resolved": [
    "astropy__astropy-12907",
    "astropy__astropy-13033",
    "astropy__astropy-13453",
    "astropy__astropy-13579",
    "astropy__astropy-14096",
    "astropy__astropy-14309",
    "astropy__astropy-14508",
    "astropy__astropy-14539",
    ...
    "sympy__sympy-24562",
    "sympy__sympy-24661"
  ]
}
```

## Dataset

https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified

Datum structure:
```
instance_id: (str) - A formatted instance identifier, usually as repo_owner__repo_name-PR-number.
patch: (str) - The gold patch, the patch generated by the PR (minus test-related code), that resolved the issue.
repo: (str) - The repository owner/name identifier from GitHub.
base_commit: (str) - The commit hash of the repository representing the HEAD of the repository before the solution PR is applied.
hints_text: (str) - Comments made on the issue prior to the creation of the solution PR’s first commit creation date.
created_at: (str) - The creation date of the pull request.
test_patch: (str) - A test-file patch that was contributed by the solution PR.
problem_statement: (str) - The issue title and body.
version: (str) - Installation version to use for running evaluation.
environment_setup_commit: (str) - commit hash to use for environment setup and installation.
FAIL_TO_PASS: (str) - A json list of strings that represent the set of tests resolved by the PR and tied to the issue resolution.
PASS_TO_PASS: (str) - A json list of strings that represent tests that should pass before and after the PR application.
```

# Features

## Display current leaderboard

Table of experiments/submissions with columns:
- Name
- Number of resolved issues
- % of resolved issues
- Model
- Model family (such as Claude, ChatGPT, etc): detect from the model id
- Org
- Open Weights: os_model
- Open Scaffold: os_system
- Checked
- Site
- Github submission: link to the yaml file

Can sort by any column. Default: number of resolevd issues

Can filter by:
  - model
  - model family
  - org
  - open weights
  - open scaffold
  - checked
  - name (substring)

## Explorer

### Panoramic view

The table has header rows/columns (don't move on scrolling and sorting) and data cells.

Table of issues from the dataset, with columns:
- issue (header col): instance_id
- score (header col): number of submissions that resolve this issue
- submission: column for every submission

The header row below the column names shows for every submission its score: number of issues it resolves.

In a data cell for a given instance/submission:
  - green if this issue is resolved in that submission
  - red if not
    - text if it's marked as 'no_generation', 'no_logs', etc

The instance id cell changes color depending on the proportion green cells are in its row: from green (all cells are green) to red (all cells are red).

The submission cell's color analogously reflects the number of green/red cells in its column.

The table is read-only for the user.

At most one cell is in focus, the user can move the focus around the table by clicking or keyboard up/down/left/right keys.

The user can select cells in the normal way by using Shift for continuous ranges and Cmd for individual cells.

The user can copy the contents of the selection, to be pastable to applications like Excel/GSheets or text editors.

### Views: Filters / Sorting

Filter columns (submissions) by:
- model
- model family
- open weights
- open scaffold
- checked
- id substring
- year
- month (without year)
- full date
  - before a given date (<=)
  - after a given date (>=)
- \# of resolved issues (=, >=, <=)
- % of resolved issues (=, >=, <=)

Filter rows (issues): 
 - repo (multiple choice)
 - name (substring)

Filters do affect the scores for submissions and for issues.

Rows can be sorted (asc/desc) by 
- id (default: asc)
- score

Columns can be sorted (asc/desc) by 
- date (default: desc, i.e. latest on the left)
- score

The user can save a View (filters + sorting) to re-use it later.

### Reports

For a given Explorer view (defined by filters/sorting):
- map: the same table, but no col headers, no text in data cells, and every data cell is a coloured square
- histograms: 
  - number of resolved issue per submission
  - number of resolving submission per issue
- classes of problems: easy, medium, hard, saturated, unsolved, etc
  - https://jatinganhotra.dev/blog/swe-agents/2025/04/15/swe-bench-verified-easy-medium-hard.html
  - https://jatinganhotra.dev/blog/swe-agents/2025/06/05/swe-bench-verified-discriminative-subsets.html
- model perf: bubble chart of the best submission (+ color of how much a harness matters) 
- harness perf: bubble chart of best submissions
- history charts: 
  - best submission score over time
  - problem hardness over time (% of resolving solutions within a quarter)
  - model family power (best score) over time:

## Display Spec

The top menu has an item to show this spec document as highlighted markdown source (not rendered markdown).

# Future benchmarks
- https://multi-swe-bench.github.io/#/