Home >Technology peripherals >AI >Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?
Building machines that can write their own code is a goal that pioneers in computer science and artificial intelligence have been pursuing. With the rapid development of GPT-type large models, such a goal is becoming closer than ever.
The emergence of large language models (Large Language Models) has attracted more and more attention from researchers to the programming capabilities of models. Under this situation, the APEX Laboratory of Shanghai Jiao Tong University launched CodeApex - a bilingual benchmark data set focused on assessing the programming understanding and code generation capabilities of LLMs.
To evaluate the programming understanding ability of large language models, CodeApex has designed three types of multiple-choice questions: conceptual understanding, common sense reasoning, and multi-hop reasoning. In addition, CodeApex also utilizes algorithmic questions and corresponding test cases to evaluate the code generation capabilities of LLMs. CodeApex evaluated a total of 14 large language models on coding tasks. Among them, GPT3.5-turbo shows the best programming ability, achieving approximately 50% and 56% accuracy on these two tasks respectively. It can be seen that large language models still have a lot of room for improvement in programming tasks. Building a machine that can write its own code is a very promising future.
Programming understanding and code generation are critical tasks in software engineering and play a key role in improving developer productivity, enhancing code quality, and automating the software development process. However, these tasks are still challenging for large models due to the complexity and semantic diversity of the code. Compared with ordinary natural language processing, using LLMs to generate code requires more emphasis on grammar, structure, detail processing and context understanding, and has extremely high requirements for the accuracy of the generated content. Traditional approaches include grammar rule-based models, template-based models, and rule-based models, which often rely on manually designed rules and heuristic algorithms that are limited in coverage and accuracy.
In recent years, with the emergence of large-scale pre-trained models such as CodeBERT and GPT3.5, researchers have begun to explore the application of these models in programming understanding and code generation tasks. These models integrate code generation tasks during training, allowing them to understand and generate code. However, a fair assessment of the progress of LLMs in code understanding and generation is difficult due to the lack of standard, publicly available, high-quality, and diverse benchmark datasets. Therefore, establishing a benchmark dataset that broadly covers code semantics and structure is crucial to promote research in programming understanding and code generation.
Existing code benchmark datasets have applicability and diversity issues when applied to LLMs. For example, some datasets are more suitable for evaluating Bert-type, bidirectional language modeling LLMs. However, existing multilingual code benchmark data sets (such as Human-Eval) contain relatively simple problems, lack diversity, and can only implement some basic functional codes.
In order to fill the above gaps, the APEX Data and Knowledge Management Laboratory of Shanghai Jiao Tong University built a new evaluation benchmark for large model code understanding and generation-CodeApex. As a groundbreaking bilingual (English, Chinese) benchmark dataset, CodeApex focuses on evaluating the programming understanding and code generation capabilities of LLMs.
The overall experimental scenario of CodeApex is shown in the picture above.
The first task, Programming Comprehension, includes 250 multiple-choice questions, divided into conceptual understanding, common sense reasoning and multi-hop reasoning. The questions used for testing are selected from the final exam questions of different courses (programming, data structures, algorithms) in colleges and universities, which greatly reduces the risk that the data is already in the LLMs training corpus. CodeApex tested the code understanding ability of LLMs in three scenarios: 0-shot, 2-shot, and 5-shot, and also tested the impact of Answer-Only and Chain-of-Thought modes on the ability of LLMs.
The second task code generation includes 476 C-based algorithm problems, covering common algorithm knowledge points, such as binary search, depth-first search, etc. CodeApex gives a description of the problem and a function prototype that implements the problem, and requires LLMs to complete the main part of the function. CodeApex also provides two scenarios: function-only and function-with-context. The difference between them is that the former only has a description of the target function, while the latter, in addition to the description of the target function, is also provided with the calling code and time of the target function. Space constraints, input and output description.
Experimental results show that different models perform differently in code-related tasks, and GPT3.5-turbo shows excellent competitiveness and obvious advantages. Furthermore, CodeApex compared the performance of LLMs in bilingual scenarios, revealing different results. Overall, there is still considerable room for improvement in the accuracy of LLMs in the CodeApex rankings, indicating that the potential of LLMs in code-related tasks has not yet been fully exploited.
To fully integrate large language models into actual code production scenarios, programming understanding is essential. Programming understanding requires the ability to understand the code from all aspects, such as mastering the syntax, understanding the code execution flow, and understanding the execution algorithm.
CodeApex extracted 250 multiple-choice questions from college final exam questions as test data. These test data are divided into three categories: conceptual understanding, common sense reasoning, and multi-hop reasoning.
Test mode includes two categories: Answer-Only and Chain-of-Thought.
The Chinese and English evaluation results of CodeApex on the code understanding task are as follows shown in the two tables. (The best performing model is shown in bold; the next best performing model is underlined.)
## The following conclusions can be drawn from it:
Training large language models to generate accurate and executable code is a challenging task. CodeApex primarily evaluates the ability of LLMs to generate algorithms based on a given description and automatically evaluates the correctness of the generated code through unit tests.
CodeApex’s code generation tasks include 476 C-based algorithm problems, covering common algorithm knowledge points, such as binary search and graph algorithms. CodeApex gives a description of the problem and a function prototype that implements the problem, and requires LLMs to complete the main part of the function.
CodeApex provides two scenarios: Function-only and Function-with-context. The Function-only scenario only provides a description of the target function, while the Function-with-context scenario not only provides a description of the target function, but also provides the calling code, time and space constraints, and input and output description of the target function.
Each language version uses two Prompt strategies (Function -Only and Function-with-Context). To align with human code testing scenarios, evaluation metrics include AC@1, AC@all and AC rate.
#The code generation task results of each model are shown in the following two tables. (Best performance: bold; second best performance: underline.)
The following conclusions can be drawn:
Additionally, CodeApex provides the proportion of compileable code in each scenario. After connecting the generated function to the main function, the compiled code is checked through test cases.
You can see:
CodeApex serves as a bilingual benchmark focusing on LLMs’ programming abilities, evaluating programming understanding and code generation of large language models. ability. In terms of programming understanding, CodeApex assessed the abilities of different models in three categories of multiple-choice questions. In terms of code generation, CodeApex uses the pass rate of test code cases to evaluate the model's capabilities. For these two tasks, CodeApex carefully designed Prompt strategies and compared them in different scenarios. CodeApex is experimentally evaluated on 14 LLMs, including general LLMs and specialized LLMs models based on code fine-tuning.
Currently, GPT3.5 has reached a relatively good level in terms of programming capabilities, achieving approximately 50% and 56% accuracy in programming understanding and code generation tasks respectively. CodeApex shows that the potential of large language models for programming tasks has not yet been fully exploited. We expect that leveraging large language models to generate code will revolutionize the field of software development in the near future. As natural language processing and machine learning advance, these models will become more powerful and adept at understanding and generating code snippets. Developers will find they have an unprecedented ally in their coding efforts, as they can rely on these models to automate tedious tasks, increase their productivity, and improve software quality.
In the future, CodeApex will release more tests (such as code correction) for testing the code capabilities of large language models. CodeApex’s test data will also continue to be updated, adding more diverse Code issues. At the same time, human experiments will also be added to the CodeApex list to compare the coding capabilities of large language models with human levels. CodeApex provides a benchmark and reference for the research on large language model programming capabilities, and will promote the development and prosperity of large language models in the code field.
Shanghai Jiao Tong University APEX Data and Knowledge Management Laboratory was established in 1996. Its founder is Tou Yu, the head teacher of the ACM class Professor Yong. The laboratory is committed to exploring artificial intelligence technology that effectively mines and manages data and summarizes knowledge. It has published more than 500 international academic papers and pursues practical applications in practical scenarios. Over the past 27 years, APEX Laboratory has become a global pioneer in many world technology waves. The laboratory began to study the core technology of the Semantic Web (now known as the Knowledge Graph) in 2000, and began to study personalized search engines and recommendations in 2003. System technology, began to study transfer learning theory and algorithm in 2006, began to explore deep learning technology in 2009 and developed neural network training library based on GPU. While producing fruitful scientific research and implementation results, APEX Lab has also developed a solid data science and machine learning research team, including Xue Guirong, Zhang Lei, Lin Chenxi, Liu Guangcan, Wang Haofen, Li Lei, Dai Wenyuan, Li Zhenhui, Chen Tianqi, Zhang Weinan, Yang Diyi and other outstanding alumni in the field of artificial intelligence.
The above is the detailed content of Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?. For more information, please follow other related articles on the PHP Chinese website!