The secret of the domestic ChatGPT 'shell' has now been found-AI-php.cn

Home

Technology peripherals

The secret of the domestic ChatGPT 'shell' has now been found

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

May 30, 2023 pm 06:09 PM

Chinese dataset

The secret of the domestic ChatGPT shell has now been found

"iFlytek is a cover-up for ChatGPT!" "Baidu Wenxin is a cover-up for Stable Diffusion!" "SenseTime's big model is actually plagiarism!"...

It’s not once or twice that the outside world has questioned domestically produced large models.

The explanation for this phenomenon by industry insiders is that there is a real shortage of high-quality Chinese data sets. When training the model, can only use the purchased foreign language annotated data sets to "act as foreign aid". If the data set used for training crashes, similar results will be generated, leading to an own incident.

Among other methods, using existing large models to assist in generating training data is prone to insufficient data cleaning. Reusing tokens will lead to overfitting. Only training sparse large models is not a long-term solution.

The industry has gradually formed a consensus:

The road to AGI will continue to put forward extremely high requirements for both data quantity and data quality.

The current situation requires that in the past two months, many domestic teams have successively open sourced Chinese data sets. In addition to general data sets, they are also targeted at programming, medical, etc. Chuiyu also has a dedicated open source Chinese data set released.

High-quality data sets are available but few

New breakthroughs in large models rely heavily on high-quality and rich data sets.

According to OpenAI's "Scaling Laws for Neural Language Models", the scaling law followed by large models(scaling law)It can be seen that independently increasing the amount of training data can make the pre-trained model The effect becomes better.

The secret of the domestic ChatGPT shell has now been found

This is not the opinion of OpenAI.

DeepMind also pointed out in the Chinchilla model paper that most of the previous large models were insufficiently trained, and also proposed the optimal training formula, which has become a recognized standard in the industry.

△Mainstream For large models, Chinchilla has the fewest parameters but the most sufficient training

However, the mainstream data sets used for training are mainly in English, such as Common Crawl, BooksCorpus, WiKipedia, ROOT, etc., which are the most popular Common Crawl Chinese data only accounts for 4.8%.

What is the situation of the Chinese data set?

There are not no public data sets - this is confirmed by Qubits from Zhou Ming, founder and CEO of Lanzhou Technology and one of the most accomplished Chinese people in the NLP field today - such as the named entity data set MSRA-NER, Weibo -NER, etc., as well as CMRC2018, CMRC2019, ExpMRC2022, etc. that can be found on GitHub, but the overall number is a drop in the bucket compared to the English data set.

Moreover, some of them are old, and they may not know the latest NLP research concepts (research related to new concepts only appears in English on arXiv).

Although high-quality Chinese data sets exist, they are small in number and cumbersome to use. This is a severe situation that all teams conducting large-scale model research have to face. At a previous Tsinghua University Department of Electronics forum, Tang Jie, a professor in the Department of Computer Science at Tsinghua University, shared that when preparing data for pre-training of the 100-billion model ChatGLM-130B, he was faced with the situation that after cleaning the Chinese data, the usable amount was less than 2TB.

It is urgent to solve the lack of high-quality data sets in the Chinese-speaking world.

One of the effective solutions is to directly use English data to train large models.

In the Chatbot Arena list of large-scale anonymous arenas rated by human players, GPT-3.5 ranks second in the non-English rankings

(the first is GPT-4) . You should know that 96% of the GPT-3.5 training data is in English. Excluding other languages, the amount of Chinese data used for training is so small that it can be calculated by "n thousandths".

The secret of the domestic ChatGPT shell has now been found

A PhD candidate in a large model-related team from one of the top 3 universities in China revealed that if this method is adopted and it is not too troublesome, one can even connect a translation software to the model to translate all languages. All are converted into English, and then the output of the model is converted into Chinese, and then returned to the user.

However, the big model fed in this way is always English thinking. When encountering content with Chinese language characteristics such as idiom rewriting, colloquial understanding, and article rewriting, it is often not handled well, resulting in translation errors or potential cultural inaccuracies. deviation.

Another solution is to collect, clean and label Chinese corpus,

make a new high-quality Chinese data set, and supply it to large models.

Open source data set everyone gathers firewood

After noticing the current situation, many large domestic model teams decided to take the second path and started using private databases to create data sets.

Baidu has content ecological data, Tencent has public account data, Zhihu has Q&A data, and Alibaba has e-commerce and logistics data.

With different accumulated private data, it is possible to establish core advantage barriers in specific scenarios and fields. Strict collection, sorting, filtering, cleaning and labeling of these data can ensure the effectiveness and accuracy of the trained model. .

And those large model teams whose private data advantages are not so obvious began to crawl data across the entire network (it is foreseeable that the amount of crawler data will be very large).

In order to build the Pangu large model, Huawei crawled 80TB of text from the Internet and finally cleaned it into a 1TB Chinese data set; the Chinese data set used for Inspur Source 1.0 training reached 5000GB (compared to the GPT3 model training data set of 570GB); the recently released Tianhe Tianyuan large model is also the result of the Tianjin Supercomputing Center’s collection of global web data, and the inclusion of various open source training data and professional field data sets.

At the same time, in the past two months, there has been a phenomenon of people gathering firewood for Chinese data sets -

Many teams have successively released open source Chinese data sets to make up for the current Chinese open source data sets. deficiencies or imbalances.

Some of them are organized as follows:

CodeGPT: Code-related conversation data set generated by GPT and GPT; the institution behind it is Fudan University.
CBook-150k: Chinese corpus book collection, including downloading and extraction methods for 150,000 Chinese books, covering many fields such as humanities, education, technology, military, politics, etc.; the organization behind it is Fudan University.
RefGPT: In order to avoid the expensive cost of manual annotation, we propose a method to automatically generate fact-based dialogues and disclose part of our data, including 50,000 items. Multiple rounds of dialogue in Chinese; behind it are NLP practitioners from Shanghai Jiao Tong University, Hong Kong Polytechnic University and other institutions.
COIG: The full name is "China Common Open Instruction Data Set", which is a larger and more diverse instruction tuning corpus, and its quality is ensured by manual verification; the background Joint institutions include Beijing Institute of Artificial Intelligence, University of Sheffield, University of Michigan, Dartmouth College, Zhejiang University, Beihang University, and Carnegie Mellon University.
Awesome Chinese Legal Resources: Chinese legal data resources, collected and organized by Shanghai Jiao Tong University.
Huatuo: A Chinese medical instruction data set constructed through the medical knowledge graph and GPT3.5 API. On this basis, LLaMA has been fine-tuned to improve the performance of the instructions. LLaMA's question-and-answer effect in the medical field; the source of the project is Harbin Institute of Technology.
Baize: Use a small number of "seed questions" to let ChatGPT chat with itself, and automatically collect it into a high-quality multi-turn conversation data set; University of California, San Diego (UCSD)The team working with Sun Yat-sen University and MSRA has made the data set collected using this method open source.

When more Chinese data sets are open sourced and brought into the spotlight, the attitude of the industry is one of welcome and joy. For example, the attitude expressed by Zhang Peng, founder and CEO of Zhipu AI:

High-quality Chinese data is just hidden in the boudoir. Now that everyone is aware of this problem, there will naturally be corresponding responses. Solutions, such as open source data.
In short, it is developing in a good direction, isn't it?

It is worth noting that in addition to pre-training data, human feedback data is also indispensable at this stage.

Ready-made examples are before us:

Compared with GPT-3, the important buff superimposed by ChatGPT is to use RLHF (Human Feedback Reinforcement Learning) to generate Fine-tuing of high-quality labeled data enables the development of large models that are aligned with human intentions.

The most direct way to provide human feedback is to tell the AI assistant "your answer is wrong", or to like or dislike directly next to the reply generated by the AI assistant.

The secret of the domestic ChatGPT shell has now been found

# Once you use it first, you can collect a wave of user feedback and let the snowball roll. This is one of the reasons why everyone is rushing to release large models.

Now, domestic ChatGPT-like products, from Baidu Wenxinyiyan, Fudan MOSS to Zhipu ChatGLM, all provide feedback options.

But in the eyes of most experience users, the most important attribute of these large model products is "toys".

When encountering an incorrect or unsatisfactory answer, you will choose to close the dialogue interface directly, which is not conducive to the collection of human feedback by the large model behind it.

The above is the detailed content of The secret of the domestic ChatGPT 'shell' has now been found. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

Gemma Scope: Google's Microscope for Peering into AI's Thought ProcessApr 17, 2025 am 11:55 AM

Exploring the Inner Workings of Language Models with Gemma Scope Understanding the complexities of AI language models is a significant challenge. Google's release of Gemma Scope, a comprehensive toolkit, offers researchers a powerful way to delve in

Who Is a Business Intelligence Analyst and How To Become One?Apr 17, 2025 am 11:44 AM

Unlocking Business Success: A Guide to Becoming a Business Intelligence Analyst Imagine transforming raw data into actionable insights that drive organizational growth. This is the power of a Business Intelligence (BI) Analyst – a crucial role in gu

How to Add a Column in SQL? - Analytics VidhyaApr 17, 2025 am 11:43 AM

SQL's ALTER TABLE Statement: Dynamically Adding Columns to Your Database In data management, SQL's adaptability is crucial. Need to adjust your database structure on the fly? The ALTER TABLE statement is your solution. This guide details adding colu

Business Analyst vs. Data AnalystApr 17, 2025 am 11:38 AM

Introduction Imagine a bustling office where two professionals collaborate on a critical project. The business analyst focuses on the company's objectives, identifying areas for improvement, and ensuring strategic alignment with market trends. Simu

What are COUNT and COUNTA in Excel? - Analytics VidhyaApr 17, 2025 am 11:34 AM

Excel data counting and analysis: detailed explanation of COUNT and COUNTA functions Accurate data counting and analysis are critical in Excel, especially when working with large data sets. Excel provides a variety of functions to achieve this, with the COUNT and COUNTA functions being key tools for counting the number of cells under different conditions. Although both functions are used to count cells, their design targets are targeted at different data types. Let's dig into the specific details of COUNT and COUNTA functions, highlight their unique features and differences, and learn how to apply them in data analysis. Overview of key points Understand COUNT and COU

Chrome is Here With AI: Experiencing Something New Everyday!!Apr 17, 2025 am 11:29 AM

Google Chrome's AI Revolution: A Personalized and Efficient Browsing Experience Artificial Intelligence (AI) is rapidly transforming our daily lives, and Google Chrome is leading the charge in the web browsing arena. This article explores the exciti

AI's Human Side: Wellbeing And The Quadruple Bottom LineApr 17, 2025 am 11:28 AM

Reimagining Impact: The Quadruple Bottom Line For too long, the conversation has been dominated by a narrow view of AI’s impact, primarily focused on the bottom line of profit. However, a more holistic approach recognizes the interconnectedness of bu

5 Game-Changing Quantum Computing Use Cases You Should Know AboutApr 17, 2025 am 11:24 AM

Things are moving steadily towards that point. The investment pouring into quantum service providers and startups shows that industry understands its significance. And a growing number of real-world use cases are emerging to demonstrate its value out

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Chat Commands and How to Use Them

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 English version

Recommended: Win version, supports code prompts!

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.