search
HomeTechnology peripheralsAIMove the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers

As Chinese large-scale language models have demonstrated strong performance in natural language understanding and natural language generation, the existing Chinese evaluation benchmark data sets for specific natural language processing tasks are no longer sufficient to evaluate large-scale Chinese models. Evaluate effectively. Traditional Chinese evaluation benchmarks mainly focus on the model's ability to understand simple common sense (such as needing to bring an umbrella when going out on a rainy day) and superficial semantics (such as whether the basketball game report is sports or technology news), while ignoring the mining and utilization of complex human knowledge. . At present, there is a lack of data sets for complex knowledge evaluation of large Chinese models, especially when it comes to professional knowledge at different levels and in different fields under our country’s education system.

In order to bridge this gap, Tianjin University Natural Language Processing Laboratory and Huawei Noah's Ark Laboratory jointly released M3KE (A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models) benchmark data set, which tests the ability of Chinese large models to master multi-level and multi-disciplinary knowledge in the form of zero samples and few samples.

Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers


  • ##Paper link: https://arxiv .org/abs/2305.10263
  • Data link: https://github.com/tjunlp-lab/M3KE
M3KE Dataset

Dataset Introduction

M3KE collected 20,477 real-life standardized test questions (including 4 candidate answers), covering 71 tasks, including elementary school, junior high school, high school, university, and graduate entrance examination questions, involving humanities, history, politics, law, education, psychology, science, engineering technology, art and other disciplines, the distribution is as shown in Fig. 1 shown.

Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers

Researchers constructed the M3KE data set based on two criteria:

1, in line with the Chinese education system, covering multiple education stages

The researchers imitated the educational experience of Chinese students, That is, primary education, junior high school, high school, university and other major education stages, aiming to evaluate the performance of the Chinese large model at different education stages. Since the knowledge points that need to be mastered at each educational stage are different (for example, in the Chinese subject, there are obvious differences in the knowledge or test points between primary school and junior high school), therefore, M3KE will include the same subjects at different educational stages. In order to improve the coverage of subject knowledge points in the data set, the researchers selected the unified examination questions in China's entrance examinations, including real questions from primary school to junior high school, high school entrance examination, college entrance examination, graduate entrance examination and Chinese civil service examination.

2, covering multi-disciplinary fields

#In order to improve the subject coverage of the data set, researchers based on humanities and arts It is constructed into three major categories: literature, science, history, politics, law, education, psychology, science, engineering technology, art and other disciplines. To further expand the richness of the data set, the researchers added tasks such as traditional Chinese medicine, religion, and computer grade examinations.

Dataset Statistics

Table 3 shows the overall statistics of M3KE. The number of tasks in the above four subject categories are 12, 21, 31 and 7 respectively, while the number of questions in the four subject categories are 3,612, 6,222, 8,162 and 2,126 respectively. The maximum number of questions included in a task is 425, and the minimum number is 100. Questions in social sciences and natural sciences are generally longer than questions in arts and humanities and other subjects, while their answer options are shorter.

Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers

Introduction and examples of M3KE from a multidisciplinary perspective

Humanities and Arts

The humanities and arts disciplines include subjects in multiple fields such as Chinese, art, and history. These subjects focus on the analysis and interpretation of literary and cultural artifacts. Taking primary school Chinese as an example, the test questions are designed to assess the language use and literary appreciation abilities of students aged 7 to 13, such as the ability to use synonyms and antonyms. The history subject covers Chinese and world history from ancient times to modern times. In addition to humanities, M3KE also includes art subjects, such as dance, art, music, film, etc. Art is an important part of human culture, and it is equally important to evaluate the performance of Chinese large models in the art field.

Art task example:

Which of the following statements about the Lascaux cave paintings is incorrect?

A. This mural was discovered in France

B. There are more than 100 animal images found

C. The time of discovery was 1940

D. The color of the mural is mainly black

World Modern History Mission Example:

It took more than two centuries from the Dutch Revolution to the French Revolution, and only half a century after that, capitalism initially formed a world system. This is mainly because?

A. The influence of the French Revolution was widely spread

B. The Vienna System intensified social conflicts in various countries

C. The Industrial Revolution rapidly increased the power of capitalism

D. Colonial rule spread across all continents of the world

Society Science

# Social science focuses on the application of humanities, such as law, politics, education, and psychology. Political courses run through multiple education stages including junior high school, high school, university, and postgraduate education, while other subjects are mainly distributed in university-level courses. Social sciences also include economics and management tasks. The test questions for these tasks are selected from the Economics Joint Examination and the Management Joint Examination in the Chinese Graduate Entrance Examination. The knowledge involves microeconomics, macroeconomics, management, logic, etc.

Criminal Law Task Example:

A wants to kill B, so he puts poison into B’s food. After B took it, A regretted it and quickly explained the situation and sent B to the hospital. During the inspection, the hospital found that the "poison" administered by A was not toxic at all, and B was safe and sound. A’s behavior belongs to?

A. Does not constitute a crime

B. Attempted crime

C. Crime discontinued

D. Completed crime

Principles of education task example:

The most basic in educational research , What is the most commonly used research method?

A. Educational observational research

B. Educational survey research

C. Educational measurement Research

D. Educational Experimental Research

Natural Science

Natural sciences include engineering, science, medicine and basic subjects such as mathematics, physics, chemistry and biology. These subjects often require complex computational, analytical and logical reasoning skills. In our country’s education system, the same subject involves different types of knowledge at different stages. For example, primary school mathematics focuses on learning basic arithmetic operations, while high school mathematics covers more advanced mathematical concepts such as sequences, derivatives, geometry, etc.

Animal Physiology Task Example:

Using procaine to anesthetize nerve fibers affects which characteristic of nerve fiber conduction excitation?

A. Physiological integrity

B. Insulation

C. Bidirectional conductivity

D. Relatively fatigue-free

Operating system task example:

Directory format has a great impact on file retrieval efficiency Large, what is the most advanced directory form below?

A. Single-level directory

B. Two-level directory

C. Three-level directory Directory

D. Tree directory

Others

##Others Types of tasks include religion, Chinese civil service exam, computer grade exam, etc. These tasks require knowledge that is not limited to the single level or discipline described above. For example, the Chinese civil service examination involves knowledge such as general knowledge, humanities, and logic, so researchers regard these tasks as an assessment of comprehensive knowledge of the Chinese large model.

Chinese Civil Service Examination Task Example:

Several previous studies have shown that eating chocolate increases the likelihood of heart disease in those who eat it. A new, more reliable study concludes that chocolate consumption is not associated with heart disease rates. It is estimated that after the results of this research are released, the consumption of chocolate will increase significantly. The above inference is based on which of the following assumptions?

A. Some people eat chocolate even though they know it increases the likelihood of heart disease

B. People I have never believed that eating chocolate will make you more likely to suffer from heart disease

C. Now many people eat chocolate because they have not heard that chocolate can cause heart disease

D. Nowadays, many people do not eat chocolate simply because they believe that chocolate can induce heart disease

Traditional Chinese Medicine Task Example:

Ginseng has the effect of replenishing vitality and replenishing qi, but what medicine is often used as a substitute for chronic debilitating diseases?

Salvia

Codonopsis pilosula

Astragalus

太子神

Introduction and examples of M3KE from the perspective of multiple education stages

The researchers divided the data set into stages according to the Chinese education system, including primary school, junior high school, High school, college and graduate entrance exams. Similarly, researchers also choose some examination subjects outside the education system, such as computer grade examinations and Chinese civil service examinations.

##Primary school

Example of Chinese language tasks for primary school:

The following words Which one is completely correct in writing?

A. The sound of nature, the flowing clouds and flowing water, the pen and the dragon and the snake, rummaging through boxes and cabinets

B. The mountains and flowing water, singing and dancing, the finishing touch, unique ideas

C. The sound lingers, the skills are clever, the pen is full of flowers, restless

D. Huang Zhongda Lu is vivid, lifelike, elite troops and reduced government

#Primary school math task example:

The price of a product is first increased by 20%, and then reduced by 20%. How does the current price compare with the original price?

A. Improved

B. Reduced

C. Unchanged

D. Don’t know

Junior high school

Example of Chinese language tasks for junior high school:

Which of the following statements is correct?

A. "The Most Painful and the Most Happy" is selected from "Selected Works of Liang Qichao". The author Liang Qichao is a thinker and scholar in the Ming Dynasty

B. " "Zou Ji satirizes the King of Qi and accepts advice" is selected from "Warring States Policy". "Warring States Policy" is a compilation of the strategies and opinions of lobbyists during the Warring States Period. It was compiled into thirty-three chapters by Liu Xiang of the Eastern Han Dynasty

C. Words are also called "long and short sentences", and sentence patterns vary in length. It flourished in the Song Dynasty. Su Shi and Xin Qiji were representatives of the bold school, while Li Qingzhao was a representative of the graceful school. , which embodies the author’s idea of ​​having fun with the people

Example of political tasks in junior high schools:

The class should be produced with the theme of “advocating the spirit of the rule of law” Xiaolan is responsible for writing the content of the "Practice Equality" section of the Blackboard newspaper. Which of the following materials she collected is suitable for selection?

A. There are special love seats on the bus for "old, weak, sick and pregnant women"

B. Middle school students go to the revolutionary traditional education base to participate Study activities

C. People's Liberation Army soldiers braved severe cold and heat to guard the borders of the motherland

D. Students used holidays to clear small advertisements on the streets

High School

Example of high school Chinese language task:

Shen Kuo in " "Mengxi Bi Tan" said: "The changes of heaven and earth, cold and heat, wind and rain, floods, droughts, locusts, all have laws." What is the philosophical meaning of this sentence?

A. Laws are the root cause of changes in objective things

B. Laws are objective and universal

C. Learn to look at problems from the perspective of connection

D. Learn to look at issues from the perspective of development

High School Example of biological task:

Environmental capacity depends on the environmental conditions of a population. Which of the following statements is correct?

The environmental capacity of the gray magpie populations in two places must be the same

The East Asian migratory locusts living in a certain grassland in different years The environmental capacity may be the same

When the population approaches the environmental capacity, the death rate will increase and the birth rate remains unchanged

Life The environmental holding capacity of crucian carp and snakehead fish in Weishan Lake is the same

大学

University of Stomatology Mission Example:

Which oral cancer ranks first in our country?

A. Alveolar mucosal cancer

#B. Buccal mucosal cancer

C. Lip Cancer

D. Tongue cancer

Example of comprehensive university economics assignment:

The following items Which item should be included in GDP?

A. Government transfer payment

B. Purchase of a used car

C. Loan and bond interest paid by the business

D. 10,000 yuan won from buying lottery tickets

Others

## Example of computer basic tasks for computer grade examination:

Because there is a lot of data in a worksheet, the title of the first row cannot always be seen when scrolling. What should I do to always see the title row? What is the fastest way?

A. Set "Print Title"

B. Freeze Pane

C. Freeze the first row

D. Freeze the first column

Religious mission example:

Religion can What is the political basis suitable for a socialist society?

A. The establishment of the people's democratic dictatorship state power

#B. The majority of believers support the socialist system and share the fundamental interests of the people of the country It is unanimous on

C. The establishment of the leadership and ruling status of the Communist Party of China

D. Be independent and run your own church

Experiment

Evaluation model

    ##GLM-335M/10B/130B, developed by Tsinghua University Pre-trained large language model, supporting Chinese and English bilingual. The researchers chose three models of the Chinese version of GLM, with parameter sizes of 335M, 10B and 130B respectively.
  • BLOOM-7.1B, a multi-language large model launched by Hugging Face, was developed by hundreds of researchers.
  • ChatGLM-6B, a language model developed at Tsinghua University, is fine-tuned using instruction data and further trained through reinforcement learning based on human feedback.
  • MOSS-16B-SFT, a language model developed by Fudan University, the instruction-fine-tuned version of MOSS-moon-003-SFT was used in the experiment.
  • BELLE-7B-0.2M, based on the language model developed by BLOOMZ-7.1B-mt and fine-tuned with 200,000 instructions.
  • BELLE-7B-2M, based on the language model developed by BLOOMZ-7.1B-mt and fine-tuned with 2 million instructions.
  • GPT-3.5-turbo, a language model developed by OpenAI. Human feedback reinforcement learning training is performed using artificially constructed high-quality instruction data.

Zero-shot/Few-shot evaluation

Model requirements under zero-sample setting Answer the question directly; under the condition of few-sample settings, the model will be given several examples of the same task in advance to guide the model to perform in-context learning. In M3KE, all questions are scored using accuracy.

Evaluation results under different subject categories

Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers


Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers

##Evaluation results under different education stages

Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers

##Analysis of results

1. In zero-sample evaluation (Table 4&6), the accuracy of all pre-trained language models (without fine-tuning) with parameters less than 10B is lower than random results (25%). The settings with few samples (Table 5&7) helps improve model performance. However, the results of GLM130B in zero-sample evaluation are better than those of few-sample evaluation. The reason may be that GLM130B has used part of the instruction data in the pre-training stage, so that it already has better zero-sample learning capabilities.

2, most of the fine-tuned Chinese large models only reach the level of random results (25%), even in the primary school level test (Table 6&7). This shows that knowledge in lower education levels is still one of the shortcomings of the current large Chinese model.

#3. In the zero-sample evaluation, BELLE-7B-2M achieved the best results among the Chinese large models, but still had a 14.8% gap with GPT-3.5-turbo. In addition, the number of supervised fine-tuning instructions is also an important factor. BELLE-7B-2M fine-tuned with two million instructions is better than BELLE-7B-0.2M fine-tuned with two hundred thousand instructions (Table 4).

4, the setting of few samples does not bring performance improvement in most cases (Table 5&7 vs Table 4&6), especially after instruction fine-tuning or reinforcement learning based on human feedback The trained language model. This shows that instruction fine-tuning of a pre-trained language model can significantly improve the zero-shot learning ability of the language model, which does not require additional examples to understand the intent of the instruction or question.

Conclusion

Researchers proposed a new benchmark, M3KE, to evaluate the knowledge mastery of Chinese large models in multiple disciplines and different educational stages. . M3KE contains 71 tasks and 20,447 questions. The researchers found that all large open-source Chinese models evaluated significantly lagged behind GPT-3.5. The researchers hope that M3KE will help discover knowledge loopholes in Chinese large models and promote the further development of Chinese large models.

All tasks in M3KE

Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers

The above is the detailed content of Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
从VAE到扩散模型:一文解读以文生图新范式从VAE到扩散模型:一文解读以文生图新范式Apr 08, 2023 pm 08:41 PM

1 前言在发布DALL·E的15个月后,OpenAI在今年春天带了续作DALL·E 2,以其更加惊艳的效果和丰富的可玩性迅速占领了各大AI社区的头条。近年来,随着生成对抗网络(GAN)、变分自编码器(VAE)、扩散模型(Diffusion models)的出现,深度学习已向世人展现其强大的图像生成能力;加上GPT-3、BERT等NLP模型的成功,人类正逐步打破文本和图像的信息界限。在DALL·E 2中,只需输入简单的文本(prompt),它就可以生成多张1024*1024的高清图像。这些图像甚至

找不到中文语音预训练模型?中文版 Wav2vec 2.0和HuBERT来了找不到中文语音预训练模型?中文版 Wav2vec 2.0和HuBERT来了Apr 08, 2023 pm 06:21 PM

Wav2vec 2.0 [1],HuBERT [2] 和 WavLM [3] 等语音预训练模型,通过在多达上万小时的无标注语音数据(如 Libri-light )上的自监督学习,显著提升了自动语音识别(Automatic Speech Recognition, ASR),语音合成(Text-to-speech, TTS)和语音转换(Voice Conversation,VC)等语音下游任务的性能。然而这些模型都没有公开的中文版本,不便于应用在中文语音研究场景。 WenetSpeech [4] 是

普林斯顿陈丹琦:如何让「大模型」变小普林斯顿陈丹琦:如何让「大模型」变小Apr 08, 2023 pm 04:01 PM

“Making large models smaller”这是很多语言模型研究人员的学术追求,针对大模型昂贵的环境和训练成本,陈丹琦在智源大会青源学术年会上做了题为“Making large models smaller”的特邀报告。报告中重点提及了基于记忆增强的TRIME算法和基于粗细粒度联合剪枝和逐层蒸馏的CofiPruning算法。前者能够在不改变模型结构的基础上兼顾语言模型困惑度和检索速度方面的优势;而后者可以在保证下游任务准确度的同时实现更快的处理速度,具有更小的模型结构。陈丹琦 普

解锁CNN和Transformer正确结合方法,字节跳动提出有效的下一代视觉Transformer解锁CNN和Transformer正确结合方法,字节跳动提出有效的下一代视觉TransformerApr 09, 2023 pm 02:01 PM

由于复杂的注意力机制和模型设计,大多数现有的视觉 Transformer(ViT)在现实的工业部署场景中不能像卷积神经网络(CNN)那样高效地执行。这就带来了一个问题:视觉神经网络能否像 CNN 一样快速推断并像 ViT 一样强大?近期一些工作试图设计 CNN-Transformer 混合架构来解决这个问题,但这些工作的整体性能远不能令人满意。基于此,来自字节跳动的研究者提出了一种能在现实工业场景中有效部署的下一代视觉 Transformer——Next-ViT。从延迟 / 准确性权衡的角度看,

Stable Diffusion XL 现已推出—有什么新功能,你知道吗?Stable Diffusion XL 现已推出—有什么新功能,你知道吗?Apr 07, 2023 pm 11:21 PM

3月27号,Stability AI的创始人兼首席执行官Emad Mostaque在一条推文中宣布,Stable Diffusion XL 现已可用于公开测试。以下是一些事项:“XL”不是这个新的AI模型的官方名称。一旦发布稳定性AI公司的官方公告,名称将会更改。与先前版本相比,图像质量有所提高与先前版本相比,图像生成速度大大加快。示例图像让我们看看新旧AI模型在结果上的差异。Prompt: Luxury sports car with aerodynamic curves, shot in a

五年后AI所需算力超100万倍!十二家机构联合发表88页长文:「智能计算」是解药五年后AI所需算力超100万倍!十二家机构联合发表88页长文:「智能计算」是解药Apr 09, 2023 pm 07:01 PM

人工智能就是一个「拼财力」的行业,如果没有高性能计算设备,别说开发基础模型,就连微调模型都做不到。但如果只靠拼硬件,单靠当前计算性能的发展速度,迟早有一天无法满足日益膨胀的需求,所以还需要配套的软件来协调统筹计算能力,这时候就需要用到「智能计算」技术。最近,来自之江实验室、中国工程院、国防科技大学、浙江大学等多达十二个国内外研究机构共同发表了一篇论文,首次对智能计算领域进行了全面的调研,涵盖了理论基础、智能与计算的技术融合、重要应用、挑战和未来前景。论文链接:​https://spj.scien

​什么是Transformer机器学习模型?​什么是Transformer机器学习模型?Apr 08, 2023 pm 06:31 PM

译者 | 李睿审校 | 孙淑娟​近年来, Transformer 机器学习模型已经成为深度学习和深度神经网络技术进步的主要亮点之一。它主要用于自然语言处理中的高级应用。谷歌正在使用它来增强其搜索引擎结果。OpenAI 使用 Transformer 创建了著名的 GPT-2和 GPT-3模型。自从2017年首次亮相以来,Transformer 架构不断发展并扩展到多种不同的变体,从语言任务扩展到其他领域。它们已被用于时间序列预测。它们是 DeepMind 的蛋白质结构预测模型 AlphaFold

AI模型告诉你,为啥巴西最可能在今年夺冠!曾精准预测前两届冠军AI模型告诉你,为啥巴西最可能在今年夺冠!曾精准预测前两届冠军Apr 09, 2023 pm 01:51 PM

说起2010年南非世界杯的最大网红,一定非「章鱼保罗」莫属!这只位于德国海洋生物中心的神奇章鱼,不仅成功预测了德国队全部七场比赛的结果,还顺利地选出了最终的总冠军西班牙队。不幸的是,保罗已经永远地离开了我们,但它的「遗产」却在人们预测足球比赛结果的尝试中持续存在。在艾伦图灵研究所(The Alan Turing Institute),随着2022年卡塔尔世界杯的持续进行,三位研究员Nick Barlow、Jack Roberts和Ryan Chan决定用一种AI算法预测今年的冠军归属。预测模型图

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version