OpenAI Drops SWE-bench Verified Due to Contamination Issues

OpenAI has announced it will cease using the SWE-bench Verified benchmark to evaluate its models, citing growing concerns over data contamination and flawed measurement. The company identified issues where problems from the benchmark's test set may have leaked into public training data, artificially inflating model performance. This 'contamination' problem is a major challenge in AI benchmarking. If a model has been indirectly trained on test questions, it may memorize solutions rather than demonstrate genuine reasoning ability, making the benchmark an unreliable measure of true coding progress. OpenAI stated that these issues make SWE-bench Verified problematic for assessing the capabilities of frontier AI systems. Instead, the company is recommending SWE-bench Pro as a more robust alternative. SWE-bench Pro is designed with stricter controls to prevent data leakage, offering a cleaner assessment of a model's ability to solve real-world software engineering tasks. This move reflects

OpenAI Drops SWE-bench Verified Due to Contamination Issues

Related news