Home >Technology peripherals >AI >Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

王林
王林forward
2023-09-05 23:29:111396browse

Building machines that can write their own code is a goal that pioneers in computer science and artificial intelligence have been pursuing. With the rapid development of GPT-type large models, such a goal is becoming closer than ever.

The emergence of large language models (Large Language Models) has attracted more and more attention from researchers to the programming capabilities of models. Under this situation, the APEX Laboratory of Shanghai Jiao Tong University launched CodeApex - a bilingual benchmark data set focused on assessing the programming understanding and code generation capabilities of LLMs.

To evaluate the programming understanding ability of large language models, CodeApex has designed three types of multiple-choice questions: conceptual understanding, common sense reasoning, and multi-hop reasoning. In addition, CodeApex also utilizes algorithmic questions and corresponding test cases to evaluate the code generation capabilities of LLMs. CodeApex evaluated a total of 14 large language models on coding tasks. Among them, GPT3.5-turbo shows the best programming ability, achieving approximately 50% and 56% accuracy on these two tasks respectively. It can be seen that large language models still have a lot of room for improvement in programming tasks. Building a machine that can write its own code is a very promising future.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

  • ## Website: https://apex.sjtu.edu.cn/codeapex/
  • Code: https://github.com/APEXLAB/CodeApex.git
  • Paper: https://apex.sjtu.edu.cn/codeapex/paper/

Introduction

Programming understanding and code generation are critical tasks in software engineering and play a key role in improving developer productivity, enhancing code quality, and automating the software development process. However, these tasks are still challenging for large models due to the complexity and semantic diversity of the code. Compared with ordinary natural language processing, using LLMs to generate code requires more emphasis on grammar, structure, detail processing and context understanding, and has extremely high requirements for the accuracy of the generated content. Traditional approaches include grammar rule-based models, template-based models, and rule-based models, which often rely on manually designed rules and heuristic algorithms that are limited in coverage and accuracy.

In recent years, with the emergence of large-scale pre-trained models such as CodeBERT and GPT3.5, researchers have begun to explore the application of these models in programming understanding and code generation tasks. These models integrate code generation tasks during training, allowing them to understand and generate code. However, a fair assessment of the progress of LLMs in code understanding and generation is difficult due to the lack of standard, publicly available, high-quality, and diverse benchmark datasets. Therefore, establishing a benchmark dataset that broadly covers code semantics and structure is crucial to promote research in programming understanding and code generation.

Existing code benchmark datasets have applicability and diversity issues when applied to LLMs. For example, some datasets are more suitable for evaluating Bert-type, bidirectional language modeling LLMs. However, existing multilingual code benchmark data sets (such as Human-Eval) contain relatively simple problems, lack diversity, and can only implement some basic functional codes.

In order to fill the above gaps, the APEX Data and Knowledge Management Laboratory of Shanghai Jiao Tong University built a new evaluation benchmark for large model code understanding and generation-CodeApex. As a groundbreaking bilingual (English, Chinese) benchmark dataset, CodeApex focuses on evaluating the programming understanding and code generation capabilities of LLMs.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

The overall experimental scenario of CodeApex is shown in the picture above.

The first task, Programming Comprehension, includes 250 multiple-choice questions, divided into conceptual understanding, common sense reasoning and multi-hop reasoning. The questions used for testing are selected from the final exam questions of different courses (programming, data structures, algorithms) in colleges and universities, which greatly reduces the risk that the data is already in the LLMs training corpus. CodeApex tested the code understanding ability of LLMs in three scenarios: 0-shot, 2-shot, and 5-shot, and also tested the impact of Answer-Only and Chain-of-Thought modes on the ability of LLMs.

The second task code generation includes 476 C-based algorithm problems, covering common algorithm knowledge points, such as binary search, depth-first search, etc. CodeApex gives a description of the problem and a function prototype that implements the problem, and requires LLMs to complete the main part of the function. CodeApex also provides two scenarios: function-only and function-with-context. The difference between them is that the former only has a description of the target function, while the latter, in addition to the description of the target function, is also provided with the calling code and time of the target function. Space constraints, input and output description.

Experimental results show that different models perform differently in code-related tasks, and GPT3.5-turbo shows excellent competitiveness and obvious advantages. Furthermore, CodeApex compared the performance of LLMs in bilingual scenarios, revealing different results. Overall, there is still considerable room for improvement in the accuracy of LLMs in the CodeApex rankings, indicating that the potential of LLMs in code-related tasks has not yet been fully exploited.

Code Understanding

To fully integrate large language models into actual code production scenarios, programming understanding is essential. Programming understanding requires the ability to understand the code from all aspects, such as mastering the syntax, understanding the code execution flow, and understanding the execution algorithm.

CodeApex extracted 250 multiple-choice questions from college final exam questions as test data. These test data are divided into three categories: conceptual understanding, common sense reasoning, and multi-hop reasoning.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

Test mode includes two categories: Answer-Only and Chain-of-Thought.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

Experimental results and conclusions

The Chinese and English evaluation results of CodeApex on the code understanding task are as follows shown in the two tables. (The best performing model is shown in bold; the next best performing model is underlined.)

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

## The following conclusions can be drawn from it:

  • Comparison of bilingual abilities. The Chinese version scored higher than the English version. There are two main reasons: (1) The source question descriptions come from the final exams of Chinese universities, so the test questions were originally presented in Chinese. Even if translated into English, they still contain some language habits unique to Chinese people. Therefore, when these biased English questions are input into LLMs, some noise may be introduced into the model's encoding results. (2) Most of the evaluated models are mainly trained on Chinese data, which leads to poor results.
  • Comparison of abilities of different question types. Across these three problem categories, approximately half of the models performed best on conceptual understanding, suggesting that they likely contained knowledge of programming concepts while being trained. Most models score higher on commonsense reasoning compared to multi-hop reasoning, indicating that the power of LLMs decreases significantly with increasing inference steps.
  • The role of CoT thinking chain model. The accuracy of most models in CoT mode is close to or lower than Answer-Only mode. There are two reasons for this phenomenon: (1) The evaluated model size does not reach the model size with CoT emergence capability. Previous research believed that the emergence of CoT requires LLMs to have at least 60B parameters. When the number of parameters is insufficient, the CoT setup may introduce additional noise and the response generated by LLMs is unstable. GPT3.5-turbo has reached the point of emergence of emergent capabilities and can achieve higher accuracy in CoT settings. (2) When answering conceptual understanding and common sense reasoning questions, multi-step reasoning is less necessary. Therefore, the CoT capabilities of LLMs cannot help with this type of problem. However, for multi-hop inference problems, some models (such as ChatGLM2, educhat, and GPT3.5-turbo) have significantly improved accuracy in CoT scenarios. (CodeApex excludes CodeT5 from the CoT setup due to its inability to generate responses via thought chains.)

Code Generation

Training large language models to generate accurate and executable code is a challenging task. CodeApex primarily evaluates the ability of LLMs to generate algorithms based on a given description and automatically evaluates the correctness of the generated code through unit tests.

CodeApex’s code generation tasks include 476 C-based algorithm problems, covering common algorithm knowledge points, such as binary search and graph algorithms. CodeApex gives a description of the problem and a function prototype that implements the problem, and requires LLMs to complete the main part of the function.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

CodeApex provides two scenarios: Function-only and Function-with-context. The Function-only scenario only provides a description of the target function, while the Function-with-context scenario not only provides a description of the target function, but also provides the calling code, time and space constraints, and input and output description of the target function.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

Experimental results and conclusions

Each language version uses two Prompt strategies (Function -Only and Function-with-Context). To align with human code testing scenarios, evaluation metrics include AC@1, AC@all and AC rate.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?


Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

#The code generation task results of each model are shown in the following two tables. (Best performance: bold; second best performance: underline.)

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

The following conclusions can be drawn:

  • GPT3.5-turbo performs better than the other 11 LLMs with an average score More than 50%.
  • WizardCoder and StarCoder ranked second and third, highlighting significant improvements in code generation capabilities through code-based fine-tuning.
  • In the code generation task, there is no obvious performance difference between the currently tested models on Chinese and English question types.

Additionally, CodeApex provides the proportion of compileable code in each scenario. After connecting the generated function to the main function, the compiled code is checked through test cases.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

You can see:

  • Most models are able to generate more than 50% of the Compile the code, which demonstrates LLMs' ability to understand function prototypes.
  • Often, providing contextual information about a function can help LLMs generate compilable code.

Conclusion

CodeApex serves as a bilingual benchmark focusing on LLMs’ programming abilities, evaluating programming understanding and code generation of large language models. ability. In terms of programming understanding, CodeApex assessed the abilities of different models in three categories of multiple-choice questions. In terms of code generation, CodeApex uses the pass rate of test code cases to evaluate the model's capabilities. For these two tasks, CodeApex carefully designed Prompt strategies and compared them in different scenarios. CodeApex is experimentally evaluated on 14 LLMs, including general LLMs and specialized LLMs models based on code fine-tuning.

Currently, GPT3.5 has reached a relatively good level in terms of programming capabilities, achieving approximately 50% and 56% accuracy in programming understanding and code generation tasks respectively. CodeApex shows that the potential of large language models for programming tasks has not yet been fully exploited. We expect that leveraging large language models to generate code will revolutionize the field of software development in the near future. As natural language processing and machine learning advance, these models will become more powerful and adept at understanding and generating code snippets. Developers will find they have an unprecedented ally in their coding efforts, as they can rely on these models to automate tedious tasks, increase their productivity, and improve software quality.

In the future, CodeApex will release more tests (such as code correction) for testing the code capabilities of large language models. CodeApex’s test data will also continue to be updated, adding more diverse Code issues. At the same time, human experiments will also be added to the CodeApex list to compare the coding capabilities of large language models with human levels. CodeApex provides a benchmark and reference for the research on large language model programming capabilities, and will promote the development and prosperity of large language models in the code field.

Introduction to APEX Laboratory

Shanghai Jiao Tong University APEX Data and Knowledge Management Laboratory was established in 1996. Its founder is Tou Yu, the head teacher of the ACM class Professor Yong. The laboratory is committed to exploring artificial intelligence technology that effectively mines and manages data and summarizes knowledge. It has published more than 500 international academic papers and pursues practical applications in practical scenarios. Over the past 27 years, APEX Laboratory has become a global pioneer in many world technology waves. The laboratory began to study the core technology of the Semantic Web (now known as the Knowledge Graph) in 2000, and began to study personalized search engines and recommendations in 2003. System technology, began to study transfer learning theory and algorithm in 2006, began to explore deep learning technology in 2009 and developed neural network training library based on GPU. While producing fruitful scientific research and implementation results, APEX Lab has also developed a solid data science and machine learning research team, including Xue Guirong, Zhang Lei, Lin Chenxi, Liu Guangcan, Wang Haofen, Li Lei, Dai Wenyuan, Li Zhenhui, Chen Tianqi, Zhang Weinan, Yang Diyi and other outstanding alumni in the field of artificial intelligence.

The above is the detailed content of Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete