metr.org
Many SWE-bench-Passing PRs Would Not Be Merged into Main
We find that roughly half of test-passing SWE-bench Verified PRs written by recent AI agents would not be merged into main by repo maintainers. A naive interpretation of benchmark scores may lead one to overestimate how useful agents are without more elicβ¦