Maison >Périphériques technologiques >IA >Le modèle OpenAI 'Strawberry' a encore été retardé. Qu'est-ce que le banc SWE Verified publié tôt le matin ?

Le modèle OpenAI 'Strawberry' a encore été retardé. Qu'est-ce que le banc SWE Verified publié tôt le matin ?

WBOY
WBOYoriginal
2024-08-14 17:08:021114parcourir
Quelqu'un a dit : « Nous nous attendions à des fraises, mais ils ont sorti du chou frisé. » Voyons à quoi sert ce « chou frisé ».

Les capacités de programmation des grands modèles ont toujours attiré beaucoup d'attention, et l'émergence du programmeur d'IA super puissant Devin a poussé le sujet « L'IA peut-elle remplacer les programmeurs » au premier plan. Récemment, Devin a également accueilli un nouvel adversaire : le programmeur d'IA autonome Genie lancé par la startup Cosine. La société a déclaré que Genie a facilement surperformé Devin, obtenant un score de 30 % sur le banc de référence tiers SWE, tandis que Devin n'a obtenu qu'un score de 13,8 %.

OpenAI「草莓」模型再次跳票,凌晨发布的SWE-bench Verified是个啥?

Ce SWE-Bench est un ensemble de données de référence utilisé pour évaluer la capacité de LLM à résoudre de vrais problèmes logiciels sur GitHub. Il collecte 2 294 paires de requêtes Issue-Pull à partir de 12 référentiels Python populaires. Pendant les tests, LLM obtiendra une base de code et une description du problème, puis générera un correctif pour résoudre le problème décrit dans le problème. Cet ensemble de données a été largement utilisé dans l’évaluation des capacités de programmation de l’IA.

À mesure que les capacités de programmation de l'IA évoluent, ce benchmark évolue également. Tôt ce matin, le modèle OpenAI « Strawberry » signalé en ligne a de nouveau été retardé, mais OpenAI a publié quelque chose de nouveau, qui est une version améliorée de SWE-Bench - SWE-bench Verified.

OpenAI a souligné que le banc SWE d'origine présente certains problèmes qui peuvent conduire à une sous-estimation des capacités autonomes d'ingénierie logicielle du modèle. Par conséquent, au cours du processus d'amélioration, ils ont travaillé avec les auteurs originaux de SWE-Bench pour effectuer une sélection manuelle et des améliorations afin de garantir que la portée des tests unitaires était appropriée et que la description du problème était claire.

Dans un nouveau test sur SWE-bench Verified, de nombreux agents de programmation d'IA ont obtenu des résultats plus élevés qu'auparavant. Parmi eux, la solution Agentless de l'UIUC a même doublé le score. OpenAI estime que cela prouve que le benchmark précédent a effectivement le défaut de sous-estimer les capacités de programmation de l'IA.

Mais pour les internautes du monde entier qui regardent "Strawberry", cette annonce est encore trop superficielle. Quelqu'un a dit : "Nous attendions des fraises, mais ils ont sorti du chou frisé

OpenAI「草莓」模型再次跳票,凌晨发布的SWE-bench Verified是个啥?

."
Background on SWE-bench

Each example in the SWE-bench test set was created from a resolved GitHub issue in 12 open source Python code repositories on GitHub. Each sample has an associated pull request (PR) that includes solution code and unit tests to verify the correctness of the code. These unit tests are called FAIL_TO_PASS tests because they fail before the solution code in the PR is added and pass after. Each sample also includes PASS_TO_PASS tests that pass before and after the PR is merged to check whether the PR breaks other features in the codebase that are not related to the issue.

In SWE-bench, the AI ​​agent gets the original text from the GitHub issue, which is the problem statement, and has access to the code base. Given this information, the agent must edit files in the code base to solve the problem.

Edit given by the AI ​​agent will be evaluated by running FAIL_TO_PASS and PASS_TO_PASS tests. If the FAIL_TO_PASS test passes, it means the edit fixed the problem. If the PASS_TO_PASS test passes, it means that the edit did not break extraneous parts of the code base. To fully resolve the original GitHub issue, both sets of tests must pass.

Three improvement directions to improve the robustness and reliability of SWE-bench

In order to improve the robustness and reliability of SWE-bench. The development team identified three main directions for improvement:

  • Unit tests used to evaluate the correctness of a solution are often too specific and sometimes not even relevant to the problem. This may result in the correct solution being rejected.
  • The problem description of many samples is not clear enough, leading to ambiguity about what the problem is and how it should be solved.
  • Sometimes it is difficult to reliably set up a SWE-bench development environment for the agent, which can inadvertently cause unit tests to fail regardless of the solution. In this case, a perfectly valid solution may be rated as incorrect.

SWE-bench Verified

To address these issues, OpenAI launched a human annotation campaign by professional software developers on every sample in the SWE-bench test set Screening is done to ensure unit tests are appropriately scoped and problem descriptions are clear and unambiguous.

Together with the authors of SWE-bench, they released SWE-bench Verified: a subset of the original test set of SWE-bench, containing 500 samples that have been verified by human annotators. This version replaces the original SWE-bench and SWE-bench Lite test suites. Additionally, they are releasing human annotations for all SWE-bench test samples.

They also collaborated with the authors of SWE-bench to develop a new evaluation tool for SWE-bench that uses a containerized Docker environment to make evaluation on SWE-bench easier. More reliable.

  • Tool address: https://github.com/princeton-nlp/SWE-bench/tree/main/docs/20240627_docker

Improvement method

OpenAI Cooperated with 93 software developers with Python experience, manually screened SWE-bench samples, and annotated 1699 random samples in the SWE-bench test set, and finally obtained SWE-bench Verified.

Their approach is to annotate the samples in the SWE-bench test set to ensure fairness and accuracy of the test. Specifically, they focus on two key points: first, assessing whether the problem description is detailed enough to prevent an overly vague description from making the test unfair; second, checking whether the FAIL_TO_PASS unit test incorrectly filters out valid solutions.

Each annotation criterion has a label in the range [0, 1, 2, 3] with increasing severity. Labels 0 and 1 are minor; labels 2 and 3 are severe, indicating that the sample is inadequate in some way and should be discarded.

Additionally, OpenAI evaluates the difficulty of each sample by asking annotators to estimate how long it would take developers to decide on and implement a solution, assuming the sample has no issues. Finally, OpenAI provides a free-form input option to flag any other major issues with the sample.

To build SWE-bench Verified, OpenAI filters out any samples from the original test set with a problem statement or FAIL_TO_PASS unit test severity of 2 or above, and also filters out all samples marked with other serious issues.

Annotation results

According to the new standard, a large part of the samples in the original SWE-bench are unqualified.As shown in the figure, 38.3% of the samples were flagged because the problem statement was not clear enough, and 61.1% were flagged because the unit tests could unfairly falsely flag valid solutions as incorrect (Severity 2, 3 two levels added together). Overall, their annotation process resulted in 68.3% of SWE-bench samples being filtered out due to unclear problem statements, unfair unit tests, or other issues.

OpenAI「草莓」模型再次跳票,凌晨发布的SWE-bench Verified是个啥?

OpenAI「草莓」模型再次跳票,凌晨发布的SWE-bench Verified是个啥?

OpenAI「草莓」模型再次跳票,凌晨发布的SWE-bench Verified是个啥?

The figure below compares the difficulty distribution of the original SWE-bench dataset and the new SWE-bench Verified dataset. They estimate the difficulty distribution of SWE-bench based on a random subset of 1699 samples.

As can be seen from the figure, in the original SWE-bench dataset, the estimated completion time of most (77.8%) samples is less than one hour of work for an experienced software engineer. SWE-bench Lite and the new SWE-bench Verified dataset further increase this proportion, with less than 10% of problems expected to take more than an hour to solve. However, the mechanisms behind this change are quite different: SWE-bench Lite is a subsampling of the original dataset to make benchmarking easier, while SWE-bench Verified attempts to remove infeasible features from the dataset sample.

OpenAI「草莓」模型再次跳票,凌晨发布的SWE-bench Verified是个啥?

Performance of each agent on SWE-bench Verified

On the new SWE-bench Verified data set, the development team used multiple algorithms that performed well on the original SWE-bench rankings The open source scaffold tests the performance of GPT-4o.

It was found that GPT-4o’s performance on the best-performing scaffold reached 33.2% on SWE-bench Verified, which is more than double the 16% score on the original SWE-bench. Overall, this confirms OpenAI's initial suspicion that the original SWE-bench underestimated the agent's capabilities.

It’s worth noting that the jump from SWE-bench Lite to SWE-bench Verified is not that noticeable because after filtering, SWE-bench Lite is already easier than the full dataset.

OpenAI「草莓」模型再次跳票,凌晨发布的SWE-bench Verified是个啥?

Performance analysis stratified by difficulty

When evaluated on SWE-bench Verified, the improvement in performance may be partially due to the distribution of test samples being skewed towards simpler samples.

OpenAI investigated this by plotting performance stratified by difficulty. If the new dataset simply changes the difficulty distribution to include easier samples, the stratified performance within each category does not change, as is the case from original SWE-bench to SWE-bench Lite.

In contrast, OpenAI observed that the agent's performance improved across difficulty categories when switching to SWE-bench Verified, consistent with the expected effect of removing impossible samples from all categories, rather than simply removing difficult samples.

OpenAI「草莓」模型再次跳票,凌晨发布的SWE-bench Verified是个啥?

Reference link: https://openai.com/index/introducing-swe-bench-verified/

Ce qui précède est le contenu détaillé de. pour plus d'informations, suivez d'autres articles connexes sur le site Web de PHP en chinois!

Déclaration:
Le contenu de cet article est volontairement contribué par les internautes et les droits d'auteur appartiennent à l'auteur original. Ce site n'assume aucune responsabilité légale correspondante. Si vous trouvez un contenu suspecté de plagiat ou de contrefaçon, veuillez contacter admin@php.cn