Before it took over her life, Leila Gharani mostly took Microsoft Excel for granted. She was working on a process optimization project for a large paper-products manufacturer—a job that, to hear her ...
We evaluate DeepCode on the PaperBench benchmark (released by OpenAI), a rigorous testbed requiring AI agents to independently reproduce 20 ICML 2024 papers from scratch. The benchmark comprises 8,316 ...
In a new benchmark named Vibe Code Bench, OpenAI’s GPT-5.1 achieved the highest level of accuracy in completing a series of software engineering tasks, narrowly beating rival Anthropic’s Claude 4.5 ...
Abstract: In recent years, there has been a notable surge in the generation of coding data on various platforms, including programming competitions and educational institutions. These platforms serve ...