search
HomeTechnology peripheralsAIBaidu Wenxinyiyan ranks last among domestic models? I was confused

Xi Xiaoyao Technology Talk Original
Author | Selling Mengjiang In recent days, our public account community has been forwarding a screenshot called SuperClue review. iFlytek even promoted it on its official account:

Baidu Wenxinyiyan ranks last among domestic models? I was confused

# Since the iFlytek Spark model has just been released, I haven’t played it very much. Is it really the most powerful one made in China? The author dares not draw any conclusions.

But in the screenshot of this evaluation, Baidu Wenxinyiyan, the most popular domestic model at the moment, can't even beat a small academic open source model ChatGLM-6B. Not only is this seriously inconsistent with the author’s own experience, but in our professional NLP technical community, everyone also expressed confusion:

Baidu Wenxinyiyan ranks last among domestic models? I was confused

Baidu Wenxinyiyan ranks last among domestic models? I was confused

Out of curiosity, the author went to the github of this superclue list to see how this evaluation conclusion was reached: https://www.php.cn/link/97c8dd44858d3568fdf9537c4b8743b2

First of all, the author noticed that there are already some issues under this repo:

Baidu Wenxinyiyan ranks last among domestic models? I was confused

Baidu Wenxinyiyan ranks last among domestic models? I was confused

It seems that this outrageous feeling is not only The author has it, and sure enough, the masses’ eyes are still sharp. . .

The author further took a look at the evaluation method of this list:

Baidu Wenxinyiyan ranks last among domestic models? I was confused

Good guy, it turns out that the so-called generative large model tests are all about letting The model does multiple choice questions. . .

Obviously, this multiple-choice evaluation method is aimed at the discriminative AI model in the BERT era. At that time, the AI ​​model generally did not have the ability to generate, but only had the ability to discriminate (such as being able to determine what a piece of text belongs to) Category, which of the options is the correct answer to the question, judging whether the semantics of two pieces of text are consistent, etc.).

The evaluation of generative models is quite different from the evaluation of discriminative models.

For example, for special generation tasks such as machine translation, evaluation indicators such as BLEU are generally used to detect the "vocabulary and phrase coverage" between the responses generated by the model and the reference responses. However, there are very few generative tasks with reference responses such as machine translation, and the vast majority of generative evaluations require manual evaluation.

For example, generation tasks such as chat-style dialogue generation, text style transfer, chapter generation, title generation, text summary, etc. require each model to be evaluated to freely generate responses, and then manually compare the responses generated by these different models. Quality, or human judgment as to whether task requirements are met.

The current round of AI competition is a competition for model generation capabilities, not a competition for model discrimination capabilities. The most powerful thing to evaluate is real user reputation, not cold academic lists anymore. What's more, it's a list that doesn't test model generation capabilities at all.

Looking back on the past few years-

In 2019, when OpenAI released GPT-2, we were piling up tricks to brush up the rankings;

In 2020, OpenAI released During GPT-3, we were piling up tricks to refresh the list;

2021-2022, when instruction tuning and RLHF work such as FLAN, T0, InstructGPT and so on broke out, we still had many teams insisting on piling up tricks to refresh the list. List...

I hope we will not repeat the same mistakes in this wave of generative model arms race.

So how should the generative AI model be tested?

I'm sorry, as I said before, it is very, very difficult to achieve unbiased testing, even more difficult than developing a generative model yourself. What are the difficulties? A few specific questions:

  • How to divide the evaluation dimensions? By understanding, memory, reasoning, expression? By area of ​​expertise? Or combine traditional NLP generative evaluation tasks?
  • How to train evaluators? For test questions with extremely high professional thresholds such as coding, debugging, mathematical derivation, and financial, legal, and medical Q&A, how do you recruit people to test?
  • How to define the evaluation criteria for highly subjective test questions (such as generating Xiaohongshu-style copywriting)?
  • Can asking a few general writing questions represent a model’s text generation/writing ability?
  • Examine the text generation sub-capabilities of the model. Are chapter generation, question and answer generation, translation, summary, and style transfer covered? Are the proportions of each task even? Are the judging criteria clear? Statistically significant?
  • In the above question and answer generation sub-task, are all vertical categories such as science, medical care, automobiles, mother and baby, finance, engineering, politics, military, entertainment, etc. covered? Is the proportion even?
  • How to evaluate conversational ability? How to design the inspection tasks for the consistency, diversity, topic depth, and personification of dialogue?
  • For the same ability test, are simple questions, medium difficulty questions and complex long-term questions covered? How to define? What proportions do they account for?

These are just a few basic problems to be solved. In the process of actual benchmark design, we have to face a large number of problems that are much more difficult than the above problems.

Therefore, as an AI practitioner, the author calls on everyone to view the rankings of various AI models rationally. There isn't even an unbiased test benchmark, so what's the use of this ranking?

Again, whether a generative model is good or not depends on real users.

No matter how high a model ranks on a list, if it cannot solve the problem you care about, it will be just an average model to you. In other words, if a model ranked at the bottom is very strong in the scenario you are concerned about, then it is a treasure model for you.

Here, the author discloses a hard case (difficult example) test set enriched and written by our team. This test set focuses on the model's ability to solve difficult problems/instructions.

This difficult test set focuses on the model's language understanding, understanding and following complex instructions, text generation, complex content generation, multiple rounds of dialogue, contradiction detection, common sense reasoning, mathematical reasoning, counterfactual reasoning, and hazard information Identification, legal and ethical awareness, Chinese literature knowledge, cross-language ability and coding ability, etc.

Again, this is a case set made by the author’s team to test the generative model’s ability to solve difficult examples. The evaluation results can only represent “which model feels better to the author’s team?” , is far from representing an unbiased test conclusion. If you want an unbiased test conclusion, please answer the evaluation questions mentioned above first, and then define an authoritative test benchmark.

Friends who want to evaluate and verify by themselves can reply to the [AI Evaluation] password in the background of this public account "Xi Xiaoyao Technology" to download the test file

The following are the evaluation results of the three most controversial models on the superclue list: iFlytek Spark, Wenxin Yiyan and ChatGPT:

Baidu Wenxinyiyan ranks last among domestic models? I was confused

Baidu Wenxinyiyan ranks last among domestic models? I was confused

Baidu Wenxinyiyan ranks last among domestic models? I was confused

##Difficult Case resolution rate:

    ChatGPT (GPT-3.5-turbo): 11/24=45.83%
  • Wen Xinyi Words (2023.5.10 version): 13/24=54.16%
  • iFlytek Spark (2023.5.10 version): 7/24=29.16%
This is to demonstrate the evidence Isn't Feixinghuo as good as Wen Xinyiyan? If you read the previous article carefully, you will understand what the author wants to say.

Indeed, although the Spark model is not as good as Wen Xinyiyan in this set of difficult cases within our team, this does not mean that one is definitely better than the other in the aggregate. It only shows that in the difficult cases within our team On the test set, Wenxinyiyan performed the best, even solving two more difficult cases than ChatGPT.

For simple questions, there is actually not much difference between the domestic model and ChatGPT. For difficult problems, each model has its own strengths. Judging from the author's team's comprehensive experience, Wen Xinyiyan is enough to beat open source models such as ChatGLM-6B for academic testing. Some capabilities are inferior to ChatGPT, and some capabilities surpass ChatGPT.

The same is true for domestic models produced by other major manufacturers such as Alibaba Tongyi Qianwen and iFlytek Spark.

Still saying that, now there is not even an unbiased test benchmark, so what’s the use of ranking the models?

Rather than arguing about various biased rankings, it is better to make a test set that you care about like the author's team did.

A model that can solve your problem is a good model.

The above is the detailed content of Baidu Wenxinyiyan ranks last among domestic models? I was confused. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
The AI Skills Gap Is Slowing Down Supply ChainsThe AI Skills Gap Is Slowing Down Supply ChainsApr 26, 2025 am 11:13 AM

The term "AI-ready workforce" is frequently used, but what does it truly mean in the supply chain industry? According to Abe Eshkenazi, CEO of the Association for Supply Chain Management (ASCM), it signifies professionals capable of critic

How One Company Is Quietly Working To Transform AI ForeverHow One Company Is Quietly Working To Transform AI ForeverApr 26, 2025 am 11:12 AM

The decentralized AI revolution is quietly gaining momentum. This Friday in Austin, Texas, the Bittensor Endgame Summit marks a pivotal moment, transitioning decentralized AI (DeAI) from theory to practical application. Unlike the glitzy commercial

Nvidia Releases NeMo Microservices To Streamline AI Agent DevelopmentNvidia Releases NeMo Microservices To Streamline AI Agent DevelopmentApr 26, 2025 am 11:11 AM

Enterprise AI faces data integration challenges The application of enterprise AI faces a major challenge: building systems that can maintain accuracy and practicality by continuously learning business data. NeMo microservices solve this problem by creating what Nvidia describes as "data flywheel", allowing AI systems to remain relevant through continuous exposure to enterprise information and user interaction. This newly launched toolkit contains five key microservices: NeMo Customizer handles fine-tuning of large language models with higher training throughput. NeMo Evaluator provides simplified evaluation of AI models for custom benchmarks. NeMo Guardrails implements security controls to maintain compliance and appropriateness

AI Paints A New Picture For The Future Of Art And DesignAI Paints A New Picture For The Future Of Art And DesignApr 26, 2025 am 11:10 AM

AI: The Future of Art and Design Artificial intelligence (AI) is changing the field of art and design in unprecedented ways, and its impact is no longer limited to amateurs, but more profoundly affecting professionals. Artwork and design schemes generated by AI are rapidly replacing traditional material images and designers in many transactional design activities such as advertising, social media image generation and web design. However, professional artists and designers also find the practical value of AI. They use AI as an auxiliary tool to explore new aesthetic possibilities, blend different styles, and create novel visual effects. AI helps artists and designers automate repetitive tasks, propose different design elements and provide creative input. AI supports style transfer, which is to apply a style of image

How Zoom Is Revolutionizing Work With Agentic AI: From Meetings To MilestonesHow Zoom Is Revolutionizing Work With Agentic AI: From Meetings To MilestonesApr 26, 2025 am 11:09 AM

Zoom, initially known for its video conferencing platform, is leading a workplace revolution with its innovative use of agentic AI. A recent conversation with Zoom's CTO, XD Huang, revealed the company's ambitious vision. Defining Agentic AI Huang d

The Existential Threat To UniversitiesThe Existential Threat To UniversitiesApr 26, 2025 am 11:08 AM

Will AI revolutionize education? This question is prompting serious reflection among educators and stakeholders. The integration of AI into education presents both opportunities and challenges. As Matthew Lynch of The Tech Edvocate notes, universit

The Prototype: American Scientists Are Looking For Jobs AbroadThe Prototype: American Scientists Are Looking For Jobs AbroadApr 26, 2025 am 11:07 AM

The development of scientific research and technology in the United States may face challenges, perhaps due to budget cuts. According to Nature, the number of American scientists applying for overseas jobs increased by 32% from January to March 2025 compared with the same period in 2024. A previous poll showed that 75% of the researchers surveyed were considering searching for jobs in Europe and Canada. Hundreds of NIH and NSF grants have been terminated in the past few months, with NIH’s new grants down by about $2.3 billion this year, a drop of nearly one-third. The leaked budget proposal shows that the Trump administration is considering sharply cutting budgets for scientific institutions, with a possible reduction of up to 50%. The turmoil in the field of basic research has also affected one of the major advantages of the United States: attracting overseas talents. 35

All About Open AI's Latest GPT 4.1 Family - Analytics VidhyaAll About Open AI's Latest GPT 4.1 Family - Analytics VidhyaApr 26, 2025 am 10:19 AM

OpenAI unveils the powerful GPT-4.1 series: a family of three advanced language models designed for real-world applications. This significant leap forward offers faster response times, enhanced comprehension, and drastically reduced costs compared t

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!