Home  >  Article  >  Technology peripherals  >  To make up for the shortcomings of Stanford's 7 billion parameter "Alpaca", a large model proficient in Chinese is here and has been open source

To make up for the shortcomings of Stanford's 7 billion parameter "Alpaca", a large model proficient in Chinese is here and has been open source

PHPz
PHPzforward
2023-03-31 22:39:141711browse

BELLE is based on Stanford Alpaca and optimized for Chinese. Model tuning only uses data produced by ChatGPT (does not include any other data).

It has been almost four months since the initial release of ChatGPT. When GPT-4 was released last week, ChatGPT immediately launched the new version. But a well-known secret is that neither ChatGPT nor GPT-4 are likely to be open source. Coupled with the huge investment in computing power and massive training data, there are many hurdles for the research community to replicate its implementation process.

Faced with the onslaught of large models such as ChatGPT, open source replacement is a good choice. At the beginning of this month, Meta "open sourced" a new large model series - LLaMA (Large Language Model Meta AI), with parameter sizes ranging from 7 billion to 65 billion. The 13 billion parameter LLaMA model outperforms the 175 billion parameter GPT-3 "on most benchmarks" and can run on a single V100 GPU.

After a few days, Stanford fine-tuned a new model Alpaca with 7 billion parameters based on LLaMA 7B. They used the technology introduced in the Self-Instruct paper to generate 52K instruction data, and made some modifications. , In preliminary human evaluations, the Alpaca 7B model performed similarly to the text-davinci-003 (GPT-3.5) model on Self-Instruct instruction evaluation.

Unfortunately, Alpaca’s seed tasks are all in English, and the data collected are also in English, so the trained model is not optimized for Chinese. In order to improve the effectiveness of the dialogue model in Chinese, is there a better way? Don’t worry, the project introduced next can solve this problem very well.

Open source Chinese dialogue large modelBELLE (Bloom-Enhanced Large Language model Engine) with 7 billion parameters is here. It is based on Stanford's Alpaca, but with Chinese optimization and some modifications to the generated code. Not only that, model tuning only uses data produced by ChatGPT (does not contain any other data).

In terms of data, the project open sourced the data collection code based on Alpaca. Based on this code, about 1 million pieces of Chinese data were generated. Combined with 50,000 pieces of English data from Alpaca, it was trained on the BLOOMZ-7B model. The checkpoint is uploaded to Hugging Face.

Hugging Face Address: https://huggingface.co/BelleGroup

The project author stated: This project aims to promote the development of the Chinese dialogue large model open source community.

To make up for the shortcomings of Stanfords 7 billion parameter Alpaca, a large model proficient in Chinese is here and has been open source

Project address: https://github.com/LianjiaTech/BELLE

Project introduction

The project mainly includes the following four parts Content:

  • 175 Chinese seed tasks
  • Code to generate data
  • 0.5M generated data
  • Based on BLOOMZ-7B1-mt optimization The final model

Data release

1. zh_seed_tasks.jsonl: Contains 175 seed tasks, the sample is as follows

{"id ": "seed_task_20", "name": "horror_movie_opening", "instruction": "You need to write a creative opening scene for a horror movie.", "instances": [{"input": "","output ":" The sun had set, leaving behind a dark town. A gentle breeze blew through the empty streets, sending a chill through anyone who ventured outside. The only sound was the slight rustling of leaves blown by the wind. Sound. Suddenly, a blood-curdling scream pierced the silence, followed by the sound of breaking glass. A house turned on its lights, and a figure could be seen running towards the center of the town. When > The figure became more and more When I got closer, I could clearly see that it was a young woman, covered in blood."}],"is_classification": false}

2. prompt_cn.txt: The prompt used to generate

3. 0.5M generated data

Data generation

Follow Alpaca’s method:

pip install -r requirements.txt
export OPENAI_API_KEY=YOUR_API_KEY
python generate_instruction.py generate_instruction_following_data

Use Completion API by default, model text-davinci-003. If you want to use the Chat API and use the gpt-3.5-turbo model, you can control it through parameters:

python generate_instruction.py generate_instruction_following_data
--api=chat --model_name=gpt-3.5-turbo

The output file is in Belle.train.json and can be manually filtered before use.

Model tuning

This project is based on the BLOOMZ-7B1-mt model and the Belle.train.json training model. The specific parameters are as follows:

To make up for the shortcomings of Stanfords 7 billion parameter Alpaca, a large model proficient in Chinese is here and has been open source

In addition, the project also uses instruction learning data sets of different sizes (200,000, 600,000, 1 million and 2 million samples) to train the model, and the different model versions are as follows:

To make up for the shortcomings of Stanfords 7 billion parameter Alpaca, a large model proficient in Chinese is here and has been open source

Model usage examples

To make up for the shortcomings of Stanfords 7 billion parameter Alpaca, a large model proficient in Chinese is here and has been open source

##Limitations and usage restrictions

The SFT model trained based on the current data and the basic model still has the following problems in terms of effect:

    Instructions involving factuality may produce wrong answers that go against the facts.
  • Hazardous instructions cannot be well identified, resulting in harmful remarks.
  • The model's capabilities still need to be improved in some scenarios involving reasoning, coding, etc.
  • Based on the limitations of the above model, this project requires developers to only use open source code, data, models and subsequent derivatives generated by this project for research purposes, and shall not use them for business or other purposes that will harm society. Harmful uses.

The above is the detailed content of To make up for the shortcomings of Stanford's 7 billion parameter "Alpaca", a large model proficient in Chinese is here and has been open source. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete