Before it took over her life, Leila Gharani mostly took Microsoft Excel for granted. She was working on a process optimization project for a large paper-products manufacturer—a job that, to hear her ...
We evaluate DeepCode on the PaperBench benchmark (released by OpenAI), a rigorous testbed requiring AI agents to independently reproduce 20 ICML 2024 papers from scratch. The benchmark comprises 8,316 ...
In a new benchmark named Vibe Code Bench, OpenAI’s GPT-5.1 achieved the highest level of accuracy in completing a series of software engineering tasks, narrowly beating rival Anthropic’s Claude 4.5 ...
Abstract: In recent years, there has been a notable surge in the generation of coding data on various platforms, including programming competitions and educational institutions. These platforms serve ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results