Home >Technology peripherals >AI >The secret of the domestic ChatGPT 'shell' has now been found
"iFlytek is a cover-up for ChatGPT!" "Baidu Wenxin is a cover-up for Stable Diffusion!" "SenseTime's big model is actually plagiarism!"...
It’s not once or twice that the outside world has questioned domestically produced large models.
The explanation for this phenomenon by industry insiders is that there is a real shortage of high-quality Chinese data sets. When training the model, can only use the purchased foreign language annotated data sets to "act as foreign aid". If the data set used for training crashes, similar results will be generated, leading to an own incident.
Among other methods, using existing large models to assist in generating training data is prone to insufficient data cleaning. Reusing tokens will lead to overfitting. Only training sparse large models is not a long-term solution.
The industry has gradually formed a consensus:
The road to AGI will continue to put forward extremely high requirements for both data quantity and data quality.
The current situation requires that in the past two months, many domestic teams have successively open sourced Chinese data sets. In addition to general data sets, they are also targeted at programming, medical, etc. Chuiyu also has a dedicated open source Chinese data set released.
New breakthroughs in large models rely heavily on high-quality and rich data sets.
According to OpenAI's "Scaling Laws for Neural Language Models", the scaling law followed by large models(scaling law)It can be seen that independently increasing the amount of training data can make the pre-trained model The effect becomes better.
This is not the opinion of OpenAI.
DeepMind also pointed out in the Chinchilla model paper that most of the previous large models were insufficiently trained, and also proposed the optimal training formula, which has become a recognized standard in the industry.
△Mainstream For large models, Chinchilla has the fewest parameters but the most sufficient training
However, the mainstream data sets used for training are mainly in English, such as Common Crawl, BooksCorpus, WiKipedia, ROOT, etc., which are the most popular Common Crawl Chinese data only accounts for 4.8%. What is the situation of the Chinese data set? There are not no public data sets - this is confirmed by Qubits from Zhou Ming, founder and CEO of Lanzhou Technology and one of the most accomplished Chinese people in the NLP field today - such as the named entity data set MSRA-NER, Weibo -NER, etc., as well as CMRC2018, CMRC2019, ExpMRC2022, etc. that can be found on GitHub, but the overall number is a drop in the bucket compared to the English data set. Moreover, some of them are old, and they may not know the latest NLP research concepts (research related to new concepts only appears in English on arXiv). Although high-quality Chinese data sets exist, they are small in number and cumbersome to use. This is a severe situation that all teams conducting large-scale model research have to face. At a previous Tsinghua University Department of Electronics forum, Tang Jie, a professor in the Department of Computer Science at Tsinghua University, shared that when preparing data for pre-training of the 100-billion model ChatGLM-130B, he was faced with the situation that after cleaning the Chinese data, the usable amount was less than 2TB. It is urgent to solve the lack of high-quality data sets in the Chinese-speaking world. One of the effective solutions is to directly use English data to train large models. In the Chatbot Arena list of large-scale anonymous arenas rated by human players, GPT-3.5 ranks second in the non-English rankings(the first is GPT-4) . You should know that 96% of the GPT-3.5 training data is in English. Excluding other languages, the amount of Chinese data used for training is so small that it can be calculated by "n thousandths".
A PhD candidate in a large model-related team from one of the top 3 universities in China revealed that if this method is adopted and it is not too troublesome, one can even connect a translation software to the model to translate all languages. All are converted into English, and then the output of the model is converted into Chinese, and then returned to the user. However, the big model fed in this way is always English thinking. When encountering content with Chinese language characteristics such as idiom rewriting, colloquial understanding, and article rewriting, it is often not handled well, resulting in translation errors or potential cultural inaccuracies. deviation. Another solution is to collect, clean and label Chinese corpus,make a new high-quality Chinese data set, and supply it to large models.
After noticing the current situation, many large domestic model teams decided to take the second path and started using private databases to create data sets.
Baidu has content ecological data, Tencent has public account data, Zhihu has Q&A data, and Alibaba has e-commerce and logistics data.
With different accumulated private data, it is possible to establish core advantage barriers in specific scenarios and fields. Strict collection, sorting, filtering, cleaning and labeling of these data can ensure the effectiveness and accuracy of the trained model. .
And those large model teams whose private data advantages are not so obvious began to crawl data across the entire network (it is foreseeable that the amount of crawler data will be very large).
In order to build the Pangu large model, Huawei crawled 80TB of text from the Internet and finally cleaned it into a 1TB Chinese data set; the Chinese data set used for Inspur Source 1.0 training reached 5000GB (compared to the GPT3 model training data set of 570GB); the recently released Tianhe Tianyuan large model is also the result of the Tianjin Supercomputing Center’s collection of global web data, and the inclusion of various open source training data and professional field data sets.
At the same time, in the past two months, there has been a phenomenon of people gathering firewood for Chinese data sets -
Many teams have successively released open source Chinese data sets to make up for the current Chinese open source data sets. deficiencies or imbalances.
Some of them are organized as follows:
When more Chinese data sets are open sourced and brought into the spotlight, the attitude of the industry is one of welcome and joy. For example, the attitude expressed by Zhang Peng, founder and CEO of Zhipu AI:
High-quality Chinese data is just hidden in the boudoir. Now that everyone is aware of this problem, there will naturally be corresponding responses. Solutions, such as open source data.
In short, it is developing in a good direction, isn't it?
It is worth noting that in addition to pre-training data, human feedback data is also indispensable at this stage.
Ready-made examples are before us:
Compared with GPT-3, the important buff superimposed by ChatGPT is to use RLHF (Human Feedback Reinforcement Learning) to generate Fine-tuing of high-quality labeled data enables the development of large models that are aligned with human intentions.
The most direct way to provide human feedback is to tell the AI assistant "your answer is wrong", or to like or dislike directly next to the reply generated by the AI assistant.
# Once you use it first, you can collect a wave of user feedback and let the snowball roll. This is one of the reasons why everyone is rushing to release large models.
Now, domestic ChatGPT-like products, from Baidu Wenxinyiyan, Fudan MOSS to Zhipu ChatGLM, all provide feedback options.
But in the eyes of most experience users, the most important attribute of these large model products is "toys".
When encountering an incorrect or unsatisfactory answer, you will choose to close the dialogue interface directly, which is not conducive to the collection of human feedback by the large model behind it.
The above is the detailed content of The secret of the domestic ChatGPT 'shell' has now been found. For more information, please follow other related articles on the PHP Chinese website!