Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?-AI-php.cn

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

王林

Sep 05, 2023 pm 11:29 PM

dataModel

Building machines that can write their own code is a goal that pioneers in computer science and artificial intelligence have been pursuing. With the rapid development of GPT-type large models, such a goal is becoming closer than ever.

The emergence of large language models (Large Language Models) has attracted more and more attention from researchers to the programming capabilities of models. Under this situation, the APEX Laboratory of Shanghai Jiao Tong University launched CodeApex - a bilingual benchmark data set focused on assessing the programming understanding and code generation capabilities of LLMs.

To evaluate the programming understanding ability of large language models, CodeApex has designed three types of multiple-choice questions: conceptual understanding, common sense reasoning, and multi-hop reasoning. In addition, CodeApex also utilizes algorithmic questions and corresponding test cases to evaluate the code generation capabilities of LLMs. CodeApex evaluated a total of 14 large language models on coding tasks. Among them, GPT3.5-turbo shows the best programming ability, achieving approximately 50% and 56% accuracy on these two tasks respectively. It can be seen that large language models still have a lot of room for improvement in programming tasks. Building a machine that can write its own code is a very promising future.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

## Website: https://apex.sjtu.edu.cn/codeapex/
Code: https://github.com/APEXLAB/CodeApex.git
Paper: https://apex.sjtu.edu.cn/codeapex/paper/

Introduction

Programming understanding and code generation are critical tasks in software engineering and play a key role in improving developer productivity, enhancing code quality, and automating the software development process. However, these tasks are still challenging for large models due to the complexity and semantic diversity of the code. Compared with ordinary natural language processing, using LLMs to generate code requires more emphasis on grammar, structure, detail processing and context understanding, and has extremely high requirements for the accuracy of the generated content. Traditional approaches include grammar rule-based models, template-based models, and rule-based models, which often rely on manually designed rules and heuristic algorithms that are limited in coverage and accuracy.

In recent years, with the emergence of large-scale pre-trained models such as CodeBERT and GPT3.5, researchers have begun to explore the application of these models in programming understanding and code generation tasks. These models integrate code generation tasks during training, allowing them to understand and generate code. However, a fair assessment of the progress of LLMs in code understanding and generation is difficult due to the lack of standard, publicly available, high-quality, and diverse benchmark datasets. Therefore, establishing a benchmark dataset that broadly covers code semantics and structure is crucial to promote research in programming understanding and code generation.

Existing code benchmark datasets have applicability and diversity issues when applied to LLMs. For example, some datasets are more suitable for evaluating Bert-type, bidirectional language modeling LLMs. However, existing multilingual code benchmark data sets (such as Human-Eval) contain relatively simple problems, lack diversity, and can only implement some basic functional codes.

In order to fill the above gaps, the APEX Data and Knowledge Management Laboratory of Shanghai Jiao Tong University built a new evaluation benchmark for large model code understanding and generation-CodeApex. As a groundbreaking bilingual (English, Chinese) benchmark dataset, CodeApex focuses on evaluating the programming understanding and code generation capabilities of LLMs.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

The overall experimental scenario of CodeApex is shown in the picture above.

The first task, Programming Comprehension, includes 250 multiple-choice questions, divided into conceptual understanding, common sense reasoning and multi-hop reasoning. The questions used for testing are selected from the final exam questions of different courses (programming, data structures, algorithms) in colleges and universities, which greatly reduces the risk that the data is already in the LLMs training corpus. CodeApex tested the code understanding ability of LLMs in three scenarios: 0-shot, 2-shot, and 5-shot, and also tested the impact of Answer-Only and Chain-of-Thought modes on the ability of LLMs.

The second task code generation includes 476 C-based algorithm problems, covering common algorithm knowledge points, such as binary search, depth-first search, etc. CodeApex gives a description of the problem and a function prototype that implements the problem, and requires LLMs to complete the main part of the function. CodeApex also provides two scenarios: function-only and function-with-context. The difference between them is that the former only has a description of the target function, while the latter, in addition to the description of the target function, is also provided with the calling code and time of the target function. Space constraints, input and output description.

Experimental results show that different models perform differently in code-related tasks, and GPT3.5-turbo shows excellent competitiveness and obvious advantages. Furthermore, CodeApex compared the performance of LLMs in bilingual scenarios, revealing different results. Overall, there is still considerable room for improvement in the accuracy of LLMs in the CodeApex rankings, indicating that the potential of LLMs in code-related tasks has not yet been fully exploited.

Code Understanding

To fully integrate large language models into actual code production scenarios, programming understanding is essential. Programming understanding requires the ability to understand the code from all aspects, such as mastering the syntax, understanding the code execution flow, and understanding the execution algorithm.

CodeApex extracted 250 multiple-choice questions from college final exam questions as test data. These test data are divided into three categories: conceptual understanding, common sense reasoning, and multi-hop reasoning.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

Test mode includes two categories: Answer-Only and Chain-of-Thought.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

Experimental results and conclusions

The Chinese and English evaluation results of CodeApex on the code understanding task are as follows shown in the two tables. (The best performing model is shown in bold; the next best performing model is underlined.)

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

## The following conclusions can be drawn from it:

Comparison of bilingual abilities. The Chinese version scored higher than the English version. There are two main reasons: (1) The source question descriptions come from the final exams of Chinese universities, so the test questions were originally presented in Chinese. Even if translated into English, they still contain some language habits unique to Chinese people. Therefore, when these biased English questions are input into LLMs, some noise may be introduced into the model's encoding results. (2) Most of the evaluated models are mainly trained on Chinese data, which leads to poor results.
Comparison of abilities of different question types. Across these three problem categories, approximately half of the models performed best on conceptual understanding, suggesting that they likely contained knowledge of programming concepts while being trained. Most models score higher on commonsense reasoning compared to multi-hop reasoning, indicating that the power of LLMs decreases significantly with increasing inference steps.
The role of CoT thinking chain model. The accuracy of most models in CoT mode is close to or lower than Answer-Only mode. There are two reasons for this phenomenon: (1) The evaluated model size does not reach the model size with CoT emergence capability. Previous research believed that the emergence of CoT requires LLMs to have at least 60B parameters. When the number of parameters is insufficient, the CoT setup may introduce additional noise and the response generated by LLMs is unstable. GPT3.5-turbo has reached the point of emergence of emergent capabilities and can achieve higher accuracy in CoT settings. (2) When answering conceptual understanding and common sense reasoning questions, multi-step reasoning is less necessary. Therefore, the CoT capabilities of LLMs cannot help with this type of problem. However, for multi-hop inference problems, some models (such as ChatGLM2, educhat, and GPT3.5-turbo) have significantly improved accuracy in CoT scenarios. (CodeApex excludes CodeT5 from the CoT setup due to its inability to generate responses via thought chains.)

Code Generation

Training large language models to generate accurate and executable code is a challenging task. CodeApex primarily evaluates the ability of LLMs to generate algorithms based on a given description and automatically evaluates the correctness of the generated code through unit tests.

CodeApex’s code generation tasks include 476 C-based algorithm problems, covering common algorithm knowledge points, such as binary search and graph algorithms. CodeApex gives a description of the problem and a function prototype that implements the problem, and requires LLMs to complete the main part of the function.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

CodeApex provides two scenarios: Function-only and Function-with-context. The Function-only scenario only provides a description of the target function, while the Function-with-context scenario not only provides a description of the target function, but also provides the calling code, time and space constraints, and input and output description of the target function.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

Experimental results and conclusions

Each language version uses two Prompt strategies (Function -Only and Function-with-Context). To align with human code testing scenarios, evaluation metrics include AC@1, AC@all and AC rate.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

#The code generation task results of each model are shown in the following two tables. (Best performance: bold; second best performance: underline.)

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

The following conclusions can be drawn:

GPT3.5-turbo performs better than the other 11 LLMs with an average score More than 50%.
WizardCoder and StarCoder ranked second and third, highlighting significant improvements in code generation capabilities through code-based fine-tuning.
In the code generation task, there is no obvious performance difference between the currently tested models on Chinese and English question types.

Additionally, CodeApex provides the proportion of compileable code in each scenario. After connecting the generated function to the main function, the compiled code is checked through test cases.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

You can see:

Most models are able to generate more than 50% of the Compile the code, which demonstrates LLMs' ability to understand function prototypes.
Often, providing contextual information about a function can help LLMs generate compilable code.

Conclusion

CodeApex serves as a bilingual benchmark focusing on LLMs’ programming abilities, evaluating programming understanding and code generation of large language models. ability. In terms of programming understanding, CodeApex assessed the abilities of different models in three categories of multiple-choice questions. In terms of code generation, CodeApex uses the pass rate of test code cases to evaluate the model's capabilities. For these two tasks, CodeApex carefully designed Prompt strategies and compared them in different scenarios. CodeApex is experimentally evaluated on 14 LLMs, including general LLMs and specialized LLMs models based on code fine-tuning.

Currently, GPT3.5 has reached a relatively good level in terms of programming capabilities, achieving approximately 50% and 56% accuracy in programming understanding and code generation tasks respectively. CodeApex shows that the potential of large language models for programming tasks has not yet been fully exploited. We expect that leveraging large language models to generate code will revolutionize the field of software development in the near future. As natural language processing and machine learning advance, these models will become more powerful and adept at understanding and generating code snippets. Developers will find they have an unprecedented ally in their coding efforts, as they can rely on these models to automate tedious tasks, increase their productivity, and improve software quality.

In the future, CodeApex will release more tests (such as code correction) for testing the code capabilities of large language models. CodeApex’s test data will also continue to be updated, adding more diverse Code issues. At the same time, human experiments will also be added to the CodeApex list to compare the coding capabilities of large language models with human levels. CodeApex provides a benchmark and reference for the research on large language model programming capabilities, and will promote the development and prosperity of large language models in the code field.

Introduction to APEX Laboratory

Shanghai Jiao Tong University APEX Data and Knowledge Management Laboratory was established in 1996. Its founder is Tou Yu, the head teacher of the ACM class Professor Yong. The laboratory is committed to exploring artificial intelligence technology that effectively mines and manages data and summarizes knowledge. It has published more than 500 international academic papers and pursues practical applications in practical scenarios. Over the past 27 years, APEX Laboratory has become a global pioneer in many world technology waves. The laboratory began to study the core technology of the Semantic Web (now known as the Knowledge Graph) in 2000, and began to study personalized search engines and recommendations in 2003. System technology, began to study transfer learning theory and algorithm in 2006, began to explore deep learning technology in 2009 and developed neural network training library based on GPU. While producing fruitful scientific research and implementation results, APEX Lab has also developed a solid data science and machine learning research team, including Xue Guirong, Zhang Lei, Lin Chenxi, Liu Guangcan, Wang Haofen, Li Lei, Dai Wenyuan, Li Zhenhui, Chen Tianqi, Zhang Weinan, Yang Diyi and other outstanding alumni in the field of artificial intelligence.

The above is the detailed content of Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

Can't use ChatGPT! Explaining the causes and solutions that can be tested immediately [Latest 2025]May 14, 2025 am 05:04 AM

ChatGPT is not accessible? This article provides a variety of practical solutions! Many users may encounter problems such as inaccessibility or slow response when using ChatGPT on a daily basis. This article will guide you to solve these problems step by step based on different situations. Causes of ChatGPT's inaccessibility and preliminary troubleshooting First, we need to determine whether the problem lies in the OpenAI server side, or the user's own network or device problems. Please follow the steps below to troubleshoot: Step 1: Check the official status of OpenAI Visit the OpenAI Status page (status.openai.com) to see if the ChatGPT service is running normally. If a red or yellow alarm is displayed, it means Open

Calculating The Risk Of ASI Starts With Human MindsMay 14, 2025 am 05:02 AM

On 10 May 2025, MIT physicist Max Tegmark told The Guardian that AI labs should emulate Oppenheimer’s Trinity-test calculus before releasing Artificial Super-Intelligence. “My assessment is that the 'Compton constant', the probability that a race to

An easy-to-understand explanation of how to write and compose lyrics and recommended tools in ChatGPTMay 14, 2025 am 05:01 AM

AI music creation technology is changing with each passing day. This article will use AI models such as ChatGPT as an example to explain in detail how to use AI to assist music creation, and explain it with actual cases. We will introduce how to create music through SunoAI, AI jukebox on Hugging Face, and Python's Music21 library. Through these technologies, everyone can easily create original music. However, it should be noted that the copyright issue of AI-generated content cannot be ignored, and you must be cautious when using it. Let’s explore the infinite possibilities of AI in the music field together! OpenAI's latest AI agent "OpenAI Deep Research" introduces: [ChatGPT]Ope

What is ChatGPT-4? A thorough explanation of what you can do, the pricing, and the differences from GPT-3.5!May 14, 2025 am 05:00 AM

The emergence of ChatGPT-4 has greatly expanded the possibility of AI applications. Compared with GPT-3.5, ChatGPT-4 has significantly improved. It has powerful context comprehension capabilities and can also recognize and generate images. It is a universal AI assistant. It has shown great potential in many fields such as improving business efficiency and assisting creation. However, at the same time, we must also pay attention to the precautions in its use. This article will explain the characteristics of ChatGPT-4 in detail and introduce effective usage methods for different scenarios. The article contains skills to make full use of the latest AI technologies, please refer to it. OpenAI's latest AI agent, please click the link below for details of "OpenAI Deep Research"

Explaining how to use the ChatGPT app! Japanese support and voice conversation functionMay 14, 2025 am 04:59 AM

ChatGPT App: Unleash your creativity with the AI assistant! Beginner's Guide The ChatGPT app is an innovative AI assistant that handles a wide range of tasks, including writing, translation, and question answering. It is a tool with endless possibilities that is useful for creative activities and information gathering. In this article, we will explain in an easy-to-understand way for beginners, from how to install the ChatGPT smartphone app, to the features unique to apps such as voice input functions and plugins, as well as the points to keep in mind when using the app. We'll also be taking a closer look at plugin restrictions and device-to-device configuration synchronization

How do I use the Chinese version of ChatGPT? Explanation of registration procedures and feesMay 14, 2025 am 04:56 AM

ChatGPT Chinese version: Unlock new experience of Chinese AI dialogue ChatGPT is popular all over the world, did you know it also offers a Chinese version? This powerful AI tool not only supports daily conversations, but also handles professional content and is compatible with Simplified and Traditional Chinese. Whether it is a user in China or a friend who is learning Chinese, you can benefit from it. This article will introduce in detail how to use ChatGPT Chinese version, including account settings, Chinese prompt word input, filter use, and selection of different packages, and analyze potential risks and response strategies. In addition, we will also compare ChatGPT Chinese version with other Chinese AI tools to help you better understand its advantages and application scenarios. OpenAI's latest AI intelligence

5 AI Agent Myths You Need To Stop Believing NowMay 14, 2025 am 04:54 AM

These can be thought of as the next leap forward in the field of generative AI, which gave us ChatGPT and other large-language-model chatbots. Rather than simply answering questions or generating information, they can take action on our behalf, inter

An easy-to-understand explanation of the illegality of creating and managing multiple accounts using ChatGPTMay 14, 2025 am 04:50 AM

Efficient multiple account management techniques using ChatGPT | A thorough explanation of how to use business and private life! ChatGPT is used in a variety of situations, but some people may be worried about managing multiple accounts. This article will explain in detail how to create multiple accounts for ChatGPT, what to do when using it, and how to operate it safely and efficiently. We also cover important points such as the difference in business and private use, and complying with OpenAI's terms of use, and provide a guide to help you safely utilize multiple accounts. OpenAI

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055612 fails to install in Windows 10?

4 weeks agoByDDD

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Nordhold: Fusion System, Explained

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

SublimeText3 Chinese version

Chinese version, very easy to use

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software