Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers-AI-php.cn

Home

Technology peripherals

Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers

PHPz

May 27, 2023 pm 09:13 PM

dataModel

As Chinese large-scale language models have demonstrated strong performance in natural language understanding and natural language generation, the existing Chinese evaluation benchmark data sets for specific natural language processing tasks are no longer sufficient to evaluate large-scale Chinese models. Evaluate effectively. Traditional Chinese evaluation benchmarks mainly focus on the model's ability to understand simple common sense (such as needing to bring an umbrella when going out on a rainy day) and superficial semantics (such as whether the basketball game report is sports or technology news), while ignoring the mining and utilization of complex human knowledge. . At present, there is a lack of data sets for complex knowledge evaluation of large Chinese models, especially when it comes to professional knowledge at different levels and in different fields under our country’s education system.

In order to bridge this gap, Tianjin University Natural Language Processing Laboratory and Huawei Noah's Ark Laboratory jointly released M3KE (A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models) benchmark data set, which tests the ability of Chinese large models to master multi-level and multi-disciplinary knowledge in the form of zero samples and few samples.

Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers

##Paper link: https://arxiv .org/abs/2305.10263
Data link: https://github.com/tjunlp-lab/M3KE

M3KE Dataset

Dataset Introduction

M3KE collected 20,477 real-life standardized test questions (including 4 candidate answers), covering 71 tasks, including elementary school, junior high school, high school, university, and graduate entrance examination questions, involving humanities, history, politics, law, education, psychology, science, engineering technology, art and other disciplines, the distribution is as shown in Fig. 1 shown.

Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers

Researchers constructed the M3KE data set based on two criteria:

1, in line with the Chinese education system, covering multiple education stages

The researchers imitated the educational experience of Chinese students, That is, primary education, junior high school, high school, university and other major education stages, aiming to evaluate the performance of the Chinese large model at different education stages. Since the knowledge points that need to be mastered at each educational stage are different (for example, in the Chinese subject, there are obvious differences in the knowledge or test points between primary school and junior high school), therefore, M3KE will include the same subjects at different educational stages. In order to improve the coverage of subject knowledge points in the data set, the researchers selected the unified examination questions in China's entrance examinations, including real questions from primary school to junior high school, high school entrance examination, college entrance examination, graduate entrance examination and Chinese civil service examination.

2, covering multi-disciplinary fields

#In order to improve the subject coverage of the data set, researchers based on humanities and arts It is constructed into three major categories: literature, science, history, politics, law, education, psychology, science, engineering technology, art and other disciplines. To further expand the richness of the data set, the researchers added tasks such as traditional Chinese medicine, religion, and computer grade examinations.

Dataset Statistics

Table 3 shows the overall statistics of M3KE. The number of tasks in the above four subject categories are 12, 21, 31 and 7 respectively, while the number of questions in the four subject categories are 3,612, 6,222, 8,162 and 2,126 respectively. The maximum number of questions included in a task is 425, and the minimum number is 100. Questions in social sciences and natural sciences are generally longer than questions in arts and humanities and other subjects, while their answer options are shorter.

Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers

Introduction and examples of M3KE from a multidisciplinary perspective

Humanities and Arts

The humanities and arts disciplines include subjects in multiple fields such as Chinese, art, and history. These subjects focus on the analysis and interpretation of literary and cultural artifacts. Taking primary school Chinese as an example, the test questions are designed to assess the language use and literary appreciation abilities of students aged 7 to 13, such as the ability to use synonyms and antonyms. The history subject covers Chinese and world history from ancient times to modern times. In addition to humanities, M3KE also includes art subjects, such as dance, art, music, film, etc. Art is an important part of human culture, and it is equally important to evaluate the performance of Chinese large models in the art field.

Art task example:

Which of the following statements about the Lascaux cave paintings is incorrect?

A. This mural was discovered in France

B. There are more than 100 animal images found

C. The time of discovery was 1940

D. The color of the mural is mainly black

World Modern History Mission Example:

It took more than two centuries from the Dutch Revolution to the French Revolution, and only half a century after that, capitalism initially formed a world system. This is mainly because?

A. The influence of the French Revolution was widely spread

B. The Vienna System intensified social conflicts in various countries

C. The Industrial Revolution rapidly increased the power of capitalism

D. Colonial rule spread across all continents of the world

Society Science

# Social science focuses on the application of humanities, such as law, politics, education, and psychology. Political courses run through multiple education stages including junior high school, high school, university, and postgraduate education, while other subjects are mainly distributed in university-level courses. Social sciences also include economics and management tasks. The test questions for these tasks are selected from the Economics Joint Examination and the Management Joint Examination in the Chinese Graduate Entrance Examination. The knowledge involves microeconomics, macroeconomics, management, logic, etc.

Criminal Law Task Example:

A wants to kill B, so he puts poison into B’s food. After B took it, A regretted it and quickly explained the situation and sent B to the hospital. During the inspection, the hospital found that the "poison" administered by A was not toxic at all, and B was safe and sound. A’s behavior belongs to?

A. Does not constitute a crime

B. Attempted crime

C. Crime discontinued

D. Completed crime

Principles of education task example:

The most basic in educational research , What is the most commonly used research method?

A. Educational observational research

B. Educational survey research

C. Educational measurement Research

D. Educational Experimental Research

Natural Science

Natural sciences include engineering, science, medicine and basic subjects such as mathematics, physics, chemistry and biology. These subjects often require complex computational, analytical and logical reasoning skills. In our country’s education system, the same subject involves different types of knowledge at different stages. For example, primary school mathematics focuses on learning basic arithmetic operations, while high school mathematics covers more advanced mathematical concepts such as sequences, derivatives, geometry, etc.

Animal Physiology Task Example:

Using procaine to anesthetize nerve fibers affects which characteristic of nerve fiber conduction excitation?

A. Physiological integrity

B. Insulation

C. Bidirectional conductivity

D. Relatively fatigue-free

Operating system task example:

Directory format has a great impact on file retrieval efficiency Large, what is the most advanced directory form below?

A. Single-level directory

B. Two-level directory

C. Three-level directory Directory

D. Tree directory

Others

##Others Types of tasks include religion, Chinese civil service exam, computer grade exam, etc. These tasks require knowledge that is not limited to the single level or discipline described above. For example, the Chinese civil service examination involves knowledge such as general knowledge, humanities, and logic, so researchers regard these tasks as an assessment of comprehensive knowledge of the Chinese large model.

Chinese Civil Service Examination Task Example:

Several previous studies have shown that eating chocolate increases the likelihood of heart disease in those who eat it. A new, more reliable study concludes that chocolate consumption is not associated with heart disease rates. It is estimated that after the results of this research are released, the consumption of chocolate will increase significantly. The above inference is based on which of the following assumptions?

A. Some people eat chocolate even though they know it increases the likelihood of heart disease

B. People I have never believed that eating chocolate will make you more likely to suffer from heart disease

C. Now many people eat chocolate because they have not heard that chocolate can cause heart disease

D. Nowadays, many people do not eat chocolate simply because they believe that chocolate can induce heart disease

Traditional Chinese Medicine Task Example:

Ginseng has the effect of replenishing vitality and replenishing qi, but what medicine is often used as a substitute for chronic debilitating diseases?

Salvia

Codonopsis pilosula

Astragalus

太子神

Introduction and examples of M3KE from the perspective of multiple education stages

The researchers divided the data set into stages according to the Chinese education system, including primary school, junior high school, High school, college and graduate entrance exams. Similarly, researchers also choose some examination subjects outside the education system, such as computer grade examinations and Chinese civil service examinations.

##Primary school

Example of Chinese language tasks for primary school:

The following words Which one is completely correct in writing?

A. The sound of nature, the flowing clouds and flowing water, the pen and the dragon and the snake, rummaging through boxes and cabinets

B. The mountains and flowing water, singing and dancing, the finishing touch, unique ideas

C. The sound lingers, the skills are clever, the pen is full of flowers, restless

D. Huang Zhongda Lu is vivid, lifelike, elite troops and reduced government

#Primary school math task example:

The price of a product is first increased by 20%, and then reduced by 20%. How does the current price compare with the original price?

A. Improved

B. Reduced

C. Unchanged

D. Don’t know

Junior high school

Example of Chinese language tasks for junior high school:

Which of the following statements is correct?

A. "The Most Painful and the Most Happy" is selected from "Selected Works of Liang Qichao". The author Liang Qichao is a thinker and scholar in the Ming Dynasty

B. " "Zou Ji satirizes the King of Qi and accepts advice" is selected from "Warring States Policy". "Warring States Policy" is a compilation of the strategies and opinions of lobbyists during the Warring States Period. It was compiled into thirty-three chapters by Liu Xiang of the Eastern Han Dynasty

C. Words are also called "long and short sentences", and sentence patterns vary in length. It flourished in the Song Dynasty. Su Shi and Xin Qiji were representatives of the bold school, while Li Qingzhao was a representative of the graceful school. , which embodies the author’s idea of having fun with the people

Example of political tasks in junior high schools:

The class should be produced with the theme of “advocating the spirit of the rule of law” Xiaolan is responsible for writing the content of the "Practice Equality" section of the Blackboard newspaper. Which of the following materials she collected is suitable for selection?

A. There are special love seats on the bus for "old, weak, sick and pregnant women"

B. Middle school students go to the revolutionary traditional education base to participate Study activities

C. People's Liberation Army soldiers braved severe cold and heat to guard the borders of the motherland

D. Students used holidays to clear small advertisements on the streets

High School

Example of high school Chinese language task:

Shen Kuo in " "Mengxi Bi Tan" said: "The changes of heaven and earth, cold and heat, wind and rain, floods, droughts, locusts, all have laws." What is the philosophical meaning of this sentence?

A. Laws are the root cause of changes in objective things

B. Laws are objective and universal

C. Learn to look at problems from the perspective of connection

D. Learn to look at issues from the perspective of development

High School Example of biological task:

Environmental capacity depends on the environmental conditions of a population. Which of the following statements is correct?

The environmental capacity of the gray magpie populations in two places must be the same

The East Asian migratory locusts living in a certain grassland in different years The environmental capacity may be the same

When the population approaches the environmental capacity, the death rate will increase and the birth rate remains unchanged

Life The environmental holding capacity of crucian carp and snakehead fish in Weishan Lake is the same

大学

University of Stomatology Mission Example:

Which oral cancer ranks first in our country?

A. Alveolar mucosal cancer

#B. Buccal mucosal cancer

C. Lip Cancer

D. Tongue cancer

Example of comprehensive university economics assignment:

The following items Which item should be included in GDP?

A. Government transfer payment

B. Purchase of a used car

C. Loan and bond interest paid by the business

D. 10,000 yuan won from buying lottery tickets

Others

## Example of computer basic tasks for computer grade examination:

Because there is a lot of data in a worksheet, the title of the first row cannot always be seen when scrolling. What should I do to always see the title row? What is the fastest way?

A. Set "Print Title"

B. Freeze Pane

C. Freeze the first row

D. Freeze the first column

Religious mission example:

Religion can What is the political basis suitable for a socialist society?

A. The establishment of the people's democratic dictatorship state power

#B. The majority of believers support the socialist system and share the fundamental interests of the people of the country It is unanimous on

C. The establishment of the leadership and ruling status of the Communist Party of China

D. Be independent and run your own church

Experiment

Evaluation model

Zero-shot/Few-shot evaluation

Model requirements under zero-sample setting Answer the question directly; under the condition of few-sample settings, the model will be given several examples of the same task in advance to guide the model to perform in-context learning. In M3KE, all questions are scored using accuracy.

Evaluation results under different subject categories

Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers

##Evaluation results under different education stages

Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers

##Analysis of results

1. In zero-sample evaluation (Table 4&6), the accuracy of all pre-trained language models (without fine-tuning) with parameters less than 10B is lower than random results (25%). The settings with few samples (Table 5&7) helps improve model performance. However, the results of GLM130B in zero-sample evaluation are better than those of few-sample evaluation. The reason may be that GLM130B has used part of the instruction data in the pre-training stage, so that it already has better zero-sample learning capabilities.

2, most of the fine-tuned Chinese large models only reach the level of random results (25%), even in the primary school level test (Table 6&7). This shows that knowledge in lower education levels is still one of the shortcomings of the current large Chinese model.

#3. In the zero-sample evaluation, BELLE-7B-2M achieved the best results among the Chinese large models, but still had a 14.8% gap with GPT-3.5-turbo. In addition, the number of supervised fine-tuning instructions is also an important factor. BELLE-7B-2M fine-tuned with two million instructions is better than BELLE-7B-0.2M fine-tuned with two hundred thousand instructions (Table 4).

4, the setting of few samples does not bring performance improvement in most cases (Table 5&7 vs Table 4&6), especially after instruction fine-tuning or reinforcement learning based on human feedback The trained language model. This shows that instruction fine-tuning of a pre-trained language model can significantly improve the zero-shot learning ability of the language model, which does not require additional examples to understand the intent of the instruction or question.

Conclusion

Researchers proposed a new benchmark, M3KE, to evaluate the knowledge mastery of Chinese large models in multiple disciplines and different educational stages. . M3KE contains 71 tasks and 20,447 questions. The researchers found that all large open-source Chinese models evaluated significantly lagged behind GPT-3.5. The researchers hope that M3KE will help discover knowledge loopholes in Chinese large models and promote the further development of Chinese large models.

All tasks in M3KE

Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers

The above is the detailed content of Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

Why Sam Altman And Others Are Now Using Vibes As A New Gauge For The Latest Progress In AIMay 06, 2025 am 11:12 AM

Let's discuss the rising use of "vibes" as an evaluation metric in the AI field. This analysis is part of my ongoing Forbes column on AI advancements, exploring complex aspects of AI development (see link here). Vibes in AI Assessment Tradi

Inside The Waymo Factory Building A Robotaxi FutureMay 06, 2025 am 11:11 AM

Waymo's Arizona Factory: Mass-Producing Self-Driving Jaguars and Beyond Located near Phoenix, Arizona, Waymo operates a state-of-the-art facility producing its fleet of autonomous Jaguar I-PACE electric SUVs. This 239,000-square-foot factory, opened

Inside S&P Global's Data-Driven Transformation With AI At The CoreMay 06, 2025 am 11:10 AM

S&P Global's Chief Digital Solutions Officer, Jigar Kocherlakota, discusses the company's AI journey, strategic acquisitions, and future-focused digital transformation. A Transformative Leadership Role and a Future-Ready Team Kocherlakota's role

The Rise Of Super-Apps: 4 Steps To Flourish In A Digital EcosystemMay 06, 2025 am 11:09 AM

From Apps to Ecosystems: Navigating the Digital Landscape The digital revolution extends far beyond social media and AI. We're witnessing the rise of "everything apps"—comprehensive digital ecosystems integrating all aspects of life. Sam A

Mastercard And Visa Unleash AI Agents To Shop For YouMay 06, 2025 am 11:08 AM

Mastercard's Agent Pay: AI-Powered Payments Revolutionize Commerce While Visa's AI-powered transaction capabilities made headlines, Mastercard has unveiled Agent Pay, a more advanced AI-native payment system built on tokenization, trust, and agentic

Backing The Bold: Future Ventures' Transformative Innovation PlaybookMay 06, 2025 am 11:07 AM

Future Ventures Fund IV: A $200M Bet on Novel Technologies Future Ventures recently closed its oversubscribed Fund IV, totaling $200 million. This new fund, managed by Steve Jurvetson, Maryanna Saenko, and Nico Enriquez, represents a significant inv

As AI Use Soars, Companies Shift From SEO To GEOMay 05, 2025 am 11:09 AM

With the explosion of AI applications, enterprises are shifting from traditional search engine optimization (SEO) to generative engine optimization (GEO). Google is leading the shift. Its "AI Overview" feature has served over a billion users, providing full answers before users click on the link. [^2] Other participants are also rapidly rising. ChatGPT, Microsoft Copilot and Perplexity are creating a new “answer engine” category that completely bypasses traditional search results. If your business doesn't show up in these AI-generated answers, potential customers may never find you—even if you rank high in traditional search results. From SEO to GEO – What exactly does this mean? For decades

Big Bets On Which Of These Pathways Will Push Today's AI To Become Prized AGIMay 05, 2025 am 11:08 AM

Let's explore the potential paths to Artificial General Intelligence (AGI). This analysis is part of my ongoing Forbes column on AI advancements, delving into the complexities of achieving AGI and Artificial Superintelligence (ASI). (See related art

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

3 weeks agoByDDD

How to fix KB5055518 fails to install in Windows 10?

3 weeks agoByDDD

Roblox: Dead Rails - How To Tame Wolves

4 weeks agoByDDD

Strength Levels for Every Enemy & Monster in R.E.P.O.

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Roblox: Grow A Garden - Complete Mutation Guide

2 weeks agoByDDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

Zend Studio 13.0.1

Powerful PHP integrated development environment

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

Hot Topics

1660

1416

1310

1260

1233