search
HomeTechnology peripheralsAIShanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

Building machines that can write their own code is a goal that pioneers in computer science and artificial intelligence have been pursuing. With the rapid development of GPT-type large models, such a goal is becoming closer than ever.

The emergence of large language models (Large Language Models) has attracted more and more attention from researchers to the programming capabilities of models. Under this situation, the APEX Laboratory of Shanghai Jiao Tong University launched CodeApex - a bilingual benchmark data set focused on assessing the programming understanding and code generation capabilities of LLMs.

To evaluate the programming understanding ability of large language models, CodeApex has designed three types of multiple-choice questions: conceptual understanding, common sense reasoning, and multi-hop reasoning. In addition, CodeApex also utilizes algorithmic questions and corresponding test cases to evaluate the code generation capabilities of LLMs. CodeApex evaluated a total of 14 large language models on coding tasks. Among them, GPT3.5-turbo shows the best programming ability, achieving approximately 50% and 56% accuracy on these two tasks respectively. It can be seen that large language models still have a lot of room for improvement in programming tasks. Building a machine that can write its own code is a very promising future.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

  • ## Website: https://apex.sjtu.edu.cn/codeapex/
  • Code: https://github.com/APEXLAB/CodeApex.git
  • Paper: https://apex.sjtu.edu.cn/codeapex/paper/

Introduction

Programming understanding and code generation are critical tasks in software engineering and play a key role in improving developer productivity, enhancing code quality, and automating the software development process. However, these tasks are still challenging for large models due to the complexity and semantic diversity of the code. Compared with ordinary natural language processing, using LLMs to generate code requires more emphasis on grammar, structure, detail processing and context understanding, and has extremely high requirements for the accuracy of the generated content. Traditional approaches include grammar rule-based models, template-based models, and rule-based models, which often rely on manually designed rules and heuristic algorithms that are limited in coverage and accuracy.

In recent years, with the emergence of large-scale pre-trained models such as CodeBERT and GPT3.5, researchers have begun to explore the application of these models in programming understanding and code generation tasks. These models integrate code generation tasks during training, allowing them to understand and generate code. However, a fair assessment of the progress of LLMs in code understanding and generation is difficult due to the lack of standard, publicly available, high-quality, and diverse benchmark datasets. Therefore, establishing a benchmark dataset that broadly covers code semantics and structure is crucial to promote research in programming understanding and code generation.

Existing code benchmark datasets have applicability and diversity issues when applied to LLMs. For example, some datasets are more suitable for evaluating Bert-type, bidirectional language modeling LLMs. However, existing multilingual code benchmark data sets (such as Human-Eval) contain relatively simple problems, lack diversity, and can only implement some basic functional codes.

In order to fill the above gaps, the APEX Data and Knowledge Management Laboratory of Shanghai Jiao Tong University built a new evaluation benchmark for large model code understanding and generation-CodeApex. As a groundbreaking bilingual (English, Chinese) benchmark dataset, CodeApex focuses on evaluating the programming understanding and code generation capabilities of LLMs.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

The overall experimental scenario of CodeApex is shown in the picture above.

The first task, Programming Comprehension, includes 250 multiple-choice questions, divided into conceptual understanding, common sense reasoning and multi-hop reasoning. The questions used for testing are selected from the final exam questions of different courses (programming, data structures, algorithms) in colleges and universities, which greatly reduces the risk that the data is already in the LLMs training corpus. CodeApex tested the code understanding ability of LLMs in three scenarios: 0-shot, 2-shot, and 5-shot, and also tested the impact of Answer-Only and Chain-of-Thought modes on the ability of LLMs.

The second task code generation includes 476 C-based algorithm problems, covering common algorithm knowledge points, such as binary search, depth-first search, etc. CodeApex gives a description of the problem and a function prototype that implements the problem, and requires LLMs to complete the main part of the function. CodeApex also provides two scenarios: function-only and function-with-context. The difference between them is that the former only has a description of the target function, while the latter, in addition to the description of the target function, is also provided with the calling code and time of the target function. Space constraints, input and output description.

Experimental results show that different models perform differently in code-related tasks, and GPT3.5-turbo shows excellent competitiveness and obvious advantages. Furthermore, CodeApex compared the performance of LLMs in bilingual scenarios, revealing different results. Overall, there is still considerable room for improvement in the accuracy of LLMs in the CodeApex rankings, indicating that the potential of LLMs in code-related tasks has not yet been fully exploited.

Code Understanding

To fully integrate large language models into actual code production scenarios, programming understanding is essential. Programming understanding requires the ability to understand the code from all aspects, such as mastering the syntax, understanding the code execution flow, and understanding the execution algorithm.

CodeApex extracted 250 multiple-choice questions from college final exam questions as test data. These test data are divided into three categories: conceptual understanding, common sense reasoning, and multi-hop reasoning.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

Test mode includes two categories: Answer-Only and Chain-of-Thought.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

Experimental results and conclusions

The Chinese and English evaluation results of CodeApex on the code understanding task are as follows shown in the two tables. (The best performing model is shown in bold; the next best performing model is underlined.)

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

## The following conclusions can be drawn from it:

  • Comparison of bilingual abilities. The Chinese version scored higher than the English version. There are two main reasons: (1) The source question descriptions come from the final exams of Chinese universities, so the test questions were originally presented in Chinese. Even if translated into English, they still contain some language habits unique to Chinese people. Therefore, when these biased English questions are input into LLMs, some noise may be introduced into the model's encoding results. (2) Most of the evaluated models are mainly trained on Chinese data, which leads to poor results.
  • Comparison of abilities of different question types. Across these three problem categories, approximately half of the models performed best on conceptual understanding, suggesting that they likely contained knowledge of programming concepts while being trained. Most models score higher on commonsense reasoning compared to multi-hop reasoning, indicating that the power of LLMs decreases significantly with increasing inference steps.
  • The role of CoT thinking chain model. The accuracy of most models in CoT mode is close to or lower than Answer-Only mode. There are two reasons for this phenomenon: (1) The evaluated model size does not reach the model size with CoT emergence capability. Previous research believed that the emergence of CoT requires LLMs to have at least 60B parameters. When the number of parameters is insufficient, the CoT setup may introduce additional noise and the response generated by LLMs is unstable. GPT3.5-turbo has reached the point of emergence of emergent capabilities and can achieve higher accuracy in CoT settings. (2) When answering conceptual understanding and common sense reasoning questions, multi-step reasoning is less necessary. Therefore, the CoT capabilities of LLMs cannot help with this type of problem. However, for multi-hop inference problems, some models (such as ChatGLM2, educhat, and GPT3.5-turbo) have significantly improved accuracy in CoT scenarios. (CodeApex excludes CodeT5 from the CoT setup due to its inability to generate responses via thought chains.)

Code Generation

Training large language models to generate accurate and executable code is a challenging task. CodeApex primarily evaluates the ability of LLMs to generate algorithms based on a given description and automatically evaluates the correctness of the generated code through unit tests.

CodeApex’s code generation tasks include 476 C-based algorithm problems, covering common algorithm knowledge points, such as binary search and graph algorithms. CodeApex gives a description of the problem and a function prototype that implements the problem, and requires LLMs to complete the main part of the function.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

CodeApex provides two scenarios: Function-only and Function-with-context. The Function-only scenario only provides a description of the target function, while the Function-with-context scenario not only provides a description of the target function, but also provides the calling code, time and space constraints, and input and output description of the target function.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

Experimental results and conclusions

Each language version uses two Prompt strategies (Function -Only and Function-with-Context). To align with human code testing scenarios, evaluation metrics include AC@1, AC@all and AC rate.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?


Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

#The code generation task results of each model are shown in the following two tables. (Best performance: bold; second best performance: underline.)

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

The following conclusions can be drawn:

  • GPT3.5-turbo performs better than the other 11 LLMs with an average score More than 50%.
  • WizardCoder and StarCoder ranked second and third, highlighting significant improvements in code generation capabilities through code-based fine-tuning.
  • In the code generation task, there is no obvious performance difference between the currently tested models on Chinese and English question types.

Additionally, CodeApex provides the proportion of compileable code in each scenario. After connecting the generated function to the main function, the compiled code is checked through test cases.

Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?

You can see:

  • Most models are able to generate more than 50% of the Compile the code, which demonstrates LLMs' ability to understand function prototypes.
  • Often, providing contextual information about a function can help LLMs generate compilable code.

Conclusion

CodeApex serves as a bilingual benchmark focusing on LLMs’ programming abilities, evaluating programming understanding and code generation of large language models. ability. In terms of programming understanding, CodeApex assessed the abilities of different models in three categories of multiple-choice questions. In terms of code generation, CodeApex uses the pass rate of test code cases to evaluate the model's capabilities. For these two tasks, CodeApex carefully designed Prompt strategies and compared them in different scenarios. CodeApex is experimentally evaluated on 14 LLMs, including general LLMs and specialized LLMs models based on code fine-tuning.

Currently, GPT3.5 has reached a relatively good level in terms of programming capabilities, achieving approximately 50% and 56% accuracy in programming understanding and code generation tasks respectively. CodeApex shows that the potential of large language models for programming tasks has not yet been fully exploited. We expect that leveraging large language models to generate code will revolutionize the field of software development in the near future. As natural language processing and machine learning advance, these models will become more powerful and adept at understanding and generating code snippets. Developers will find they have an unprecedented ally in their coding efforts, as they can rely on these models to automate tedious tasks, increase their productivity, and improve software quality.

In the future, CodeApex will release more tests (such as code correction) for testing the code capabilities of large language models. CodeApex’s test data will also continue to be updated, adding more diverse Code issues. At the same time, human experiments will also be added to the CodeApex list to compare the coding capabilities of large language models with human levels. CodeApex provides a benchmark and reference for the research on large language model programming capabilities, and will promote the development and prosperity of large language models in the code field.

Introduction to APEX Laboratory

Shanghai Jiao Tong University APEX Data and Knowledge Management Laboratory was established in 1996. Its founder is Tou Yu, the head teacher of the ACM class Professor Yong. The laboratory is committed to exploring artificial intelligence technology that effectively mines and manages data and summarizes knowledge. It has published more than 500 international academic papers and pursues practical applications in practical scenarios. Over the past 27 years, APEX Laboratory has become a global pioneer in many world technology waves. The laboratory began to study the core technology of the Semantic Web (now known as the Knowledge Graph) in 2000, and began to study personalized search engines and recommendations in 2003. System technology, began to study transfer learning theory and algorithm in 2006, began to explore deep learning technology in 2009 and developed neural network training library based on GPU. While producing fruitful scientific research and implementation results, APEX Lab has also developed a solid data science and machine learning research team, including Xue Guirong, Zhang Lei, Lin Chenxi, Liu Guangcan, Wang Haofen, Li Lei, Dai Wenyuan, Li Zhenhui, Chen Tianqi, Zhang Weinan, Yang Diyi and other outstanding alumni in the field of artificial intelligence.

The above is the detailed content of Shanghai Jiao Tong University releases CodeApex, a large-model bilingual programming evaluation benchmark. Have machines really begun to challenge humans in writing code?. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
How to Build Your Personal AI Assistant with Huggingface SmolLMHow to Build Your Personal AI Assistant with Huggingface SmolLMApr 18, 2025 am 11:52 AM

Harness the Power of On-Device AI: Building a Personal Chatbot CLI In the recent past, the concept of a personal AI assistant seemed like science fiction. Imagine Alex, a tech enthusiast, dreaming of a smart, local AI companion—one that doesn't rely

AI For Mental Health Gets Attentively Analyzed Via Exciting New Initiative At Stanford UniversityAI For Mental Health Gets Attentively Analyzed Via Exciting New Initiative At Stanford UniversityApr 18, 2025 am 11:49 AM

Their inaugural launch of AI4MH took place on April 15, 2025, and luminary Dr. Tom Insel, M.D., famed psychiatrist and neuroscientist, served as the kick-off speaker. Dr. Insel is renowned for his outstanding work in mental health research and techno

The 2025 WNBA Draft Class Enters A League Growing And Fighting Online HarassmentThe 2025 WNBA Draft Class Enters A League Growing And Fighting Online HarassmentApr 18, 2025 am 11:44 AM

"We want to ensure that the WNBA remains a space where everyone, players, fans and corporate partners, feel safe, valued and empowered," Engelbert stated, addressing what has become one of women's sports' most damaging challenges. The anno

Comprehensive Guide to Python Built-in Data Structures - Analytics VidhyaComprehensive Guide to Python Built-in Data Structures - Analytics VidhyaApr 18, 2025 am 11:43 AM

Introduction Python excels as a programming language, particularly in data science and generative AI. Efficient data manipulation (storage, management, and access) is crucial when dealing with large datasets. We've previously covered numbers and st

First Impressions From OpenAI's New Models Compared To AlternativesFirst Impressions From OpenAI's New Models Compared To AlternativesApr 18, 2025 am 11:41 AM

Before diving in, an important caveat: AI performance is non-deterministic and highly use-case specific. In simpler terms, Your Mileage May Vary. Don't take this (or any other) article as the final word—instead, test these models on your own scenario

AI Portfolio | How to Build a Portfolio for an AI Career?AI Portfolio | How to Build a Portfolio for an AI Career?Apr 18, 2025 am 11:40 AM

Building a Standout AI/ML Portfolio: A Guide for Beginners and Professionals Creating a compelling portfolio is crucial for securing roles in artificial intelligence (AI) and machine learning (ML). This guide provides advice for building a portfolio

What Agentic AI Could Mean For Security OperationsWhat Agentic AI Could Mean For Security OperationsApr 18, 2025 am 11:36 AM

The result? Burnout, inefficiency, and a widening gap between detection and action. None of this should come as a shock to anyone who works in cybersecurity. The promise of agentic AI has emerged as a potential turning point, though. This new class

Google Versus OpenAI: The AI Fight For StudentsGoogle Versus OpenAI: The AI Fight For StudentsApr 18, 2025 am 11:31 AM

Immediate Impact versus Long-Term Partnership? Two weeks ago OpenAI stepped forward with a powerful short-term offer, granting U.S. and Canadian college students free access to ChatGPT Plus through the end of May 2025. This tool includes GPT‑4o, an a

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use