Home  >  Article  >  Technology peripherals  >  The secret of the domestic ChatGPT "shell" has now been found

The secret of the domestic ChatGPT "shell" has now been found

WBOY
WBOYforward
2023-05-30 18:09:071402browse

The secret of the domestic ChatGPT shell has now been found

"iFlytek is a cover-up for ChatGPT!" "Baidu Wenxin is a cover-up for Stable Diffusion!" "SenseTime's big model is actually plagiarism!"...

It’s not once or twice that the outside world has questioned domestically produced large models.

The explanation for this phenomenon by industry insiders is that there is a real shortage of high-quality Chinese data sets. When training the model, can only use the purchased foreign language annotated data sets to "act as foreign aid". If the data set used for training crashes, similar results will be generated, leading to an own incident.

Among other methods, using existing large models to assist in generating training data is prone to insufficient data cleaning. Reusing tokens will lead to overfitting. Only training sparse large models is not a long-term solution.

The industry has gradually formed a consensus:

The road to AGI will continue to put forward extremely high requirements for both data quantity and data quality.

The current situation requires that in the past two months, many domestic teams have successively open sourced Chinese data sets. In addition to general data sets, they are also targeted at programming, medical, etc. Chuiyu also has a dedicated open source Chinese data set released.

High-quality data sets are available but few

New breakthroughs in large models rely heavily on high-quality and rich data sets.

According to OpenAI's "Scaling Laws for Neural Language Models", the scaling law followed by large models(scaling law)It can be seen that independently increasing the amount of training data can make the pre-trained model The effect becomes better.

The secret of the domestic ChatGPT shell has now been found

This is not the opinion of OpenAI.

DeepMind also pointed out in the Chinchilla model paper that most of the previous large models were insufficiently trained, and also proposed the optimal training formula, which has become a recognized standard in the industry.


The secret of the domestic ChatGPT shell has now been found

Mainstream For large models, Chinchilla has the fewest parameters but the most sufficient training

However, the mainstream data sets used for training are mainly in English, such as Common Crawl, BooksCorpus, WiKipedia, ROOT, etc., which are the most popular Common Crawl Chinese data only accounts for 4.8%.

What is the situation of the Chinese data set?

There are not no public data sets - this is confirmed by Qubits from Zhou Ming, founder and CEO of Lanzhou Technology and one of the most accomplished Chinese people in the NLP field today - such as the named entity data set MSRA-NER, Weibo -NER, etc., as well as CMRC2018, CMRC2019, ExpMRC2022, etc. that can be found on GitHub, but the overall number is a drop in the bucket compared to the English data set.

Moreover, some of them are old, and they may not know the latest NLP research concepts (research related to new concepts only appears in English on arXiv).

Although high-quality Chinese data sets exist, they are small in number and cumbersome to use. This is a severe situation that all teams conducting large-scale model research have to face. At a previous Tsinghua University Department of Electronics forum, Tang Jie, a professor in the Department of Computer Science at Tsinghua University, shared that when preparing data for pre-training of the 100-billion model ChatGLM-130B, he was faced with the situation that after cleaning the Chinese data, the usable amount was less than 2TB.

It is urgent to solve the lack of high-quality data sets in the Chinese-speaking world.

One of the effective solutions is to directly use English data to train large models.

In the Chatbot Arena list of large-scale anonymous arenas rated by human players, GPT-3.5 ranks second in the non-English rankings

(the first is GPT-4) . You should know that 96% of the GPT-3.5 training data is in English. Excluding other languages, the amount of Chinese data used for training is so small that it can be calculated by "n thousandths".

The secret of the domestic ChatGPT shell has now been found

A PhD candidate in a large model-related team from one of the top 3 universities in China revealed that if this method is adopted and it is not too troublesome, one can even connect a translation software to the model to translate all languages. All are converted into English, and then the output of the model is converted into Chinese, and then returned to the user.

However, the big model fed in this way is always English thinking. When encountering content with Chinese language characteristics such as idiom rewriting, colloquial understanding, and article rewriting, it is often not handled well, resulting in translation errors or potential cultural inaccuracies. deviation.

Another solution is to collect, clean and label Chinese corpus,

make a new high-quality Chinese data set, and supply it to large models.

Open source data set everyone gathers firewood

After noticing the current situation, many large domestic model teams decided to take the second path and started using private databases to create data sets.

Baidu has content ecological data, Tencent has public account data, Zhihu has Q&A data, and Alibaba has e-commerce and logistics data.

With different accumulated private data, it is possible to establish core advantage barriers in specific scenarios and fields. Strict collection, sorting, filtering, cleaning and labeling of these data can ensure the effectiveness and accuracy of the trained model. .

And those large model teams whose private data advantages are not so obvious began to crawl data across the entire network (it is foreseeable that the amount of crawler data will be very large).

In order to build the Pangu large model, Huawei crawled 80TB of text from the Internet and finally cleaned it into a 1TB Chinese data set; the Chinese data set used for Inspur Source 1.0 training reached 5000GB (compared to the GPT3 model training data set of 570GB); the recently released Tianhe Tianyuan large model is also the result of the Tianjin Supercomputing Center’s collection of global web data, and the inclusion of various open source training data and professional field data sets.

At the same time, in the past two months, there has been a phenomenon of people gathering firewood for Chinese data sets -

Many teams have successively released open source Chinese data sets to make up for the current Chinese open source data sets. deficiencies or imbalances.

Some of them are organized as follows:

  • CodeGPT: Code-related conversation data set generated by GPT and GPT; the institution behind it is Fudan University.
  • CBook-150k: Chinese corpus book collection, including downloading and extraction methods for 150,000 Chinese books, covering many fields such as humanities, education, technology, military, politics, etc.; the organization behind it is Fudan University.
  • RefGPT: In order to avoid the expensive cost of manual annotation, we propose a method to automatically generate fact-based dialogues and disclose part of our data, including 50,000 items. Multiple rounds of dialogue in Chinese; behind it are NLP practitioners from Shanghai Jiao Tong University, Hong Kong Polytechnic University and other institutions.
  • COIG: The full name is "China Common Open Instruction Data Set", which is a larger and more diverse instruction tuning corpus, and its quality is ensured by manual verification; the background Joint institutions include Beijing Institute of Artificial Intelligence, University of Sheffield, University of Michigan, Dartmouth College, Zhejiang University, Beihang University, and Carnegie Mellon University.
  • Awesome Chinese Legal Resources: Chinese legal data resources, collected and organized by Shanghai Jiao Tong University.
  • Huatuo: A Chinese medical instruction data set constructed through the medical knowledge graph and GPT3.5 API. On this basis, LLaMA has been fine-tuned to improve the performance of the instructions. LLaMA's question-and-answer effect in the medical field; the source of the project is Harbin Institute of Technology.
  • Baize: Use a small number of "seed questions" to let ChatGPT chat with itself, and automatically collect it into a high-quality multi-turn conversation data set; University of California, San Diego (UCSD)The team working with Sun Yat-sen University and MSRA has made the data set collected using this method open source.

When more Chinese data sets are open sourced and brought into the spotlight, the attitude of the industry is one of welcome and joy. For example, the attitude expressed by Zhang Peng, founder and CEO of Zhipu AI:

High-quality Chinese data is just hidden in the boudoir. Now that everyone is aware of this problem, there will naturally be corresponding responses. Solutions, such as open source data.
In short, it is developing in a good direction, isn't it?

It is worth noting that in addition to pre-training data, human feedback data is also indispensable at this stage.

Ready-made examples are before us:

Compared with GPT-3, the important buff superimposed by ChatGPT is to use RLHF (Human Feedback Reinforcement Learning) to generate Fine-tuing of high-quality labeled data enables the development of large models that are aligned with human intentions.

The most direct way to provide human feedback is to tell the AI ​​assistant "your answer is wrong", or to like or dislike directly next to the reply generated by the AI ​​assistant.

The secret of the domestic ChatGPT shell has now been found

# Once you use it first, you can collect a wave of user feedback and let the snowball roll. This is one of the reasons why everyone is rushing to release large models.

Now, domestic ChatGPT-like products, from Baidu Wenxinyiyan, Fudan MOSS to Zhipu ChatGLM, all provide feedback options.

But in the eyes of most experience users, the most important attribute of these large model products is "toys".

When encountering an incorrect or unsatisfactory answer, you will choose to close the dialogue interface directly, which is not conducive to the collection of human feedback by the large model behind it.

The above is the detailed content of The secret of the domestic ChatGPT "shell" has now been found. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete