Home >Technology peripherals >AI >Baidu Wenxinyiyan ranks last among domestic models? I was confused

Baidu Wenxinyiyan ranks last among domestic models? I was confused

WBOY
WBOYforward
2023-05-24 09:25:051735browse

Xi Xiaoyao Technology Talk Original
Author | Selling Mengjiang In recent days, our public account community has been forwarding a screenshot called SuperClue review. iFlytek even promoted it on its official account:

Baidu Wenxinyiyan ranks last among domestic models? I was confused

# Since the iFlytek Spark model has just been released, I haven’t played it very much. Is it really the most powerful one made in China? The author dares not draw any conclusions.

But in the screenshot of this evaluation, Baidu Wenxinyiyan, the most popular domestic model at the moment, can't even beat a small academic open source model ChatGLM-6B. Not only is this seriously inconsistent with the author’s own experience, but in our professional NLP technical community, everyone also expressed confusion:

Baidu Wenxinyiyan ranks last among domestic models? I was confused

Baidu Wenxinyiyan ranks last among domestic models? I was confused

Out of curiosity, the author went to the github of this superclue list to see how this evaluation conclusion was reached: https://www.php.cn/link/97c8dd44858d3568fdf9537c4b8743b2

First of all, the author noticed that there are already some issues under this repo:

Baidu Wenxinyiyan ranks last among domestic models? I was confused

Baidu Wenxinyiyan ranks last among domestic models? I was confused

It seems that this outrageous feeling is not only The author has it, and sure enough, the masses’ eyes are still sharp. . .

The author further took a look at the evaluation method of this list:

Baidu Wenxinyiyan ranks last among domestic models? I was confused

Good guy, it turns out that the so-called generative large model tests are all about letting The model does multiple choice questions. . .

Obviously, this multiple-choice evaluation method is aimed at the discriminative AI model in the BERT era. At that time, the AI ​​model generally did not have the ability to generate, but only had the ability to discriminate (such as being able to determine what a piece of text belongs to) Category, which of the options is the correct answer to the question, judging whether the semantics of two pieces of text are consistent, etc.).

The evaluation of generative models is quite different from the evaluation of discriminative models.

For example, for special generation tasks such as machine translation, evaluation indicators such as BLEU are generally used to detect the "vocabulary and phrase coverage" between the responses generated by the model and the reference responses. However, there are very few generative tasks with reference responses such as machine translation, and the vast majority of generative evaluations require manual evaluation.

For example, generation tasks such as chat-style dialogue generation, text style transfer, chapter generation, title generation, text summary, etc. require each model to be evaluated to freely generate responses, and then manually compare the responses generated by these different models. Quality, or human judgment as to whether task requirements are met.

The current round of AI competition is a competition for model generation capabilities, not a competition for model discrimination capabilities. The most powerful thing to evaluate is real user reputation, not cold academic lists anymore. What's more, it's a list that doesn't test model generation capabilities at all.

Looking back on the past few years-

In 2019, when OpenAI released GPT-2, we were piling up tricks to brush up the rankings;

In 2020, OpenAI released During GPT-3, we were piling up tricks to refresh the list;

2021-2022, when instruction tuning and RLHF work such as FLAN, T0, InstructGPT and so on broke out, we still had many teams insisting on piling up tricks to refresh the list. List...

I hope we will not repeat the same mistakes in this wave of generative model arms race.

So how should the generative AI model be tested?

I'm sorry, as I said before, it is very, very difficult to achieve unbiased testing, even more difficult than developing a generative model yourself. What are the difficulties? A few specific questions:

  • How to divide the evaluation dimensions? By understanding, memory, reasoning, expression? By area of ​​expertise? Or combine traditional NLP generative evaluation tasks?
  • How to train evaluators? For test questions with extremely high professional thresholds such as coding, debugging, mathematical derivation, and financial, legal, and medical Q&A, how do you recruit people to test?
  • How to define the evaluation criteria for highly subjective test questions (such as generating Xiaohongshu-style copywriting)?
  • Can asking a few general writing questions represent a model’s text generation/writing ability?
  • Examine the text generation sub-capabilities of the model. Are chapter generation, question and answer generation, translation, summary, and style transfer covered? Are the proportions of each task even? Are the judging criteria clear? Statistically significant?
  • In the above question and answer generation sub-task, are all vertical categories such as science, medical care, automobiles, mother and baby, finance, engineering, politics, military, entertainment, etc. covered? Is the proportion even?
  • How to evaluate conversational ability? How to design the inspection tasks for the consistency, diversity, topic depth, and personification of dialogue?
  • For the same ability test, are simple questions, medium difficulty questions and complex long-term questions covered? How to define? What proportions do they account for?

These are just a few basic problems to be solved. In the process of actual benchmark design, we have to face a large number of problems that are much more difficult than the above problems.

Therefore, as an AI practitioner, the author calls on everyone to view the rankings of various AI models rationally. There isn't even an unbiased test benchmark, so what's the use of this ranking?

Again, whether a generative model is good or not depends on real users.

No matter how high a model ranks on a list, if it cannot solve the problem you care about, it will be just an average model to you. In other words, if a model ranked at the bottom is very strong in the scenario you are concerned about, then it is a treasure model for you.

Here, the author discloses a hard case (difficult example) test set enriched and written by our team. This test set focuses on the model's ability to solve difficult problems/instructions.

This difficult test set focuses on the model's language understanding, understanding and following complex instructions, text generation, complex content generation, multiple rounds of dialogue, contradiction detection, common sense reasoning, mathematical reasoning, counterfactual reasoning, and hazard information Identification, legal and ethical awareness, Chinese literature knowledge, cross-language ability and coding ability, etc.

Again, this is a case set made by the author’s team to test the generative model’s ability to solve difficult examples. The evaluation results can only represent “which model feels better to the author’s team?” , is far from representing an unbiased test conclusion. If you want an unbiased test conclusion, please answer the evaluation questions mentioned above first, and then define an authoritative test benchmark.

Friends who want to evaluate and verify by themselves can reply to the [AI Evaluation] password in the background of this public account "Xi Xiaoyao Technology" to download the test file

The following are the evaluation results of the three most controversial models on the superclue list: iFlytek Spark, Wenxin Yiyan and ChatGPT:

Baidu Wenxinyiyan ranks last among domestic models? I was confused

Baidu Wenxinyiyan ranks last among domestic models? I was confused

Baidu Wenxinyiyan ranks last among domestic models? I was confused

##Difficult Case resolution rate:

    ChatGPT (GPT-3.5-turbo): 11/24=45.83%
  • Wen Xinyi Words (2023.5.10 version): 13/24=54.16%
  • iFlytek Spark (2023.5.10 version): 7/24=29.16%
This is to demonstrate the evidence Isn't Feixinghuo as good as Wen Xinyiyan? If you read the previous article carefully, you will understand what the author wants to say.

Indeed, although the Spark model is not as good as Wen Xinyiyan in this set of difficult cases within our team, this does not mean that one is definitely better than the other in the aggregate. It only shows that in the difficult cases within our team On the test set, Wenxinyiyan performed the best, even solving two more difficult cases than ChatGPT.

For simple questions, there is actually not much difference between the domestic model and ChatGPT. For difficult problems, each model has its own strengths. Judging from the author's team's comprehensive experience, Wen Xinyiyan is enough to beat open source models such as ChatGLM-6B for academic testing. Some capabilities are inferior to ChatGPT, and some capabilities surpass ChatGPT.

The same is true for domestic models produced by other major manufacturers such as Alibaba Tongyi Qianwen and iFlytek Spark.

Still saying that, now there is not even an unbiased test benchmark, so what’s the use of ranking the models?

Rather than arguing about various biased rankings, it is better to make a test set that you care about like the author's team did.

A model that can solve your problem is a good model.

The above is the detailed content of Baidu Wenxinyiyan ranks last among domestic models? I was confused. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete