To make up for the shortcomings of Stanford's 7 billion parameter 'Alpaca', a large model proficient in Chinese is here and has been open source-AI-php.cn

To make up for the shortcomings of Stanford's 7 billion parameter 'Alpaca', a large model proficient in Chinese is here and has been open source

PHPz

Mar 31, 2023 pm 10:39 PM

dataModel

BELLE is based on Stanford Alpaca and optimized for Chinese. Model tuning only uses data produced by ChatGPT (does not include any other data).

It has been almost four months since the initial release of ChatGPT. When GPT-4 was released last week, ChatGPT immediately launched the new version. But a well-known secret is that neither ChatGPT nor GPT-4 are likely to be open source. Coupled with the huge investment in computing power and massive training data, there are many hurdles for the research community to replicate its implementation process.

Faced with the onslaught of large models such as ChatGPT, open source replacement is a good choice. At the beginning of this month, Meta "open sourced" a new large model series - LLaMA (Large Language Model Meta AI), with parameter sizes ranging from 7 billion to 65 billion. The 13 billion parameter LLaMA model outperforms the 175 billion parameter GPT-3 "on most benchmarks" and can run on a single V100 GPU.

After a few days, Stanford fine-tuned a new model Alpaca with 7 billion parameters based on LLaMA 7B. They used the technology introduced in the Self-Instruct paper to generate 52K instruction data, and made some modifications. , In preliminary human evaluations, the Alpaca 7B model performed similarly to the text-davinci-003 (GPT-3.5) model on Self-Instruct instruction evaluation.

Unfortunately, Alpaca’s seed tasks are all in English, and the data collected are also in English, so the trained model is not optimized for Chinese. In order to improve the effectiveness of the dialogue model in Chinese, is there a better way? Don’t worry, the project introduced next can solve this problem very well.

Open source Chinese dialogue large modelBELLE (Bloom-Enhanced Large Language model Engine) with 7 billion parameters is here. It is based on Stanford's Alpaca, but with Chinese optimization and some modifications to the generated code. Not only that, model tuning only uses data produced by ChatGPT (does not contain any other data).

In terms of data, the project open sourced the data collection code based on Alpaca. Based on this code, about 1 million pieces of Chinese data were generated. Combined with 50,000 pieces of English data from Alpaca, it was trained on the BLOOMZ-7B model. The checkpoint is uploaded to Hugging Face.

Hugging Face Address: https://huggingface.co/BelleGroup

The project author stated: This project aims to promote the development of the Chinese dialogue large model open source community.

To make up for the shortcomings of Stanfords 7 billion parameter Alpaca, a large model proficient in Chinese is here and has been open source

Project address: https://github.com/LianjiaTech/BELLE

Project introduction

The project mainly includes the following four parts Content:

175 Chinese seed tasks
Code to generate data
0.5M generated data
Based on BLOOMZ-7B1-mt optimization The final model

Data release

1. zh_seed_tasks.jsonl: Contains 175 seed tasks, the sample is as follows

{"id ": "seed_task_20", "name": "horror_movie_opening", "instruction": "You need to write a creative opening scene for a horror movie.", "instances": [{"input": "","output ":" The sun had set, leaving behind a dark town. A gentle breeze blew through the empty streets, sending a chill through anyone who ventured outside. The only sound was the slight rustling of leaves blown by the wind. Sound. Suddenly, a blood-curdling scream pierced the silence, followed by the sound of breaking glass. A house turned on its lights, and a figure could be seen running towards the center of the town. When > The figure became more and more When I got closer, I could clearly see that it was a young woman, covered in blood."}],"is_classification": false}

2. prompt_cn.txt: The prompt used to generate

3. 0.5M generated data

Data generation

Follow Alpaca’s method:

pip install -r requirements.txt
export OPENAI_API_KEY=YOUR_API_KEY
python generate_instruction.py generate_instruction_following_data

Use Completion API by default, model text-davinci-003. If you want to use the Chat API and use the gpt-3.5-turbo model, you can control it through parameters:

python generate_instruction.py generate_instruction_following_data
--api=chat --model_name=gpt-3.5-turbo

The output file is in Belle.train.json and can be manually filtered before use.

Model tuning

This project is based on the BLOOMZ-7B1-mt model and the Belle.train.json training model. The specific parameters are as follows:

To make up for the shortcomings of Stanfords 7 billion parameter Alpaca, a large model proficient in Chinese is here and has been open source

In addition, the project also uses instruction learning data sets of different sizes (200,000, 600,000, 1 million and 2 million samples) to train the model, and the different model versions are as follows:

To make up for the shortcomings of Stanfords 7 billion parameter Alpaca, a large model proficient in Chinese is here and has been open source

Model usage examples

To make up for the shortcomings of Stanfords 7 billion parameter Alpaca, a large model proficient in Chinese is here and has been open source

##Limitations and usage restrictions

The SFT model trained based on the current data and the basic model still has the following problems in terms of effect:

Hazardous instructions cannot be well identified, resulting in harmful remarks.
The model's capabilities still need to be improved in some scenarios involving reasoning, coding, etc.
Based on the limitations of the above model, this project requires developers to only use open source code, data, models and subsequent derivatives generated by this project for research purposes, and shall not use them for business or other purposes that will harm society. Harmful uses.

The above is the detailed content of To make up for the shortcomings of Stanford's 7 billion parameter 'Alpaca', a large model proficient in Chinese is here and has been open source. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51cto. If there is any infringement, please contact admin@php.cn delete

How to Run LLM Locally Using LM Studio? - Analytics VidhyaApr 19, 2025 am 11:38 AM

Running large language models at home with ease: LM Studio User Guide In recent years, advances in software and hardware have made it possible to run large language models (LLMs) on personal computers. LM Studio is an excellent tool to make this process easy and convenient. This article will dive into how to run LLM locally using LM Studio, covering key steps, potential challenges, and the benefits of having LLM locally. Whether you are a tech enthusiast or are curious about the latest AI technologies, this guide will provide valuable insights and practical tips. Let's get started! Overview Understand the basic requirements for running LLM locally. Set up LM Studi on your computer

Guy Peri Helps Flavor McCormick's Future Through Data TransformationApr 19, 2025 am 11:35 AM

Guy Peri is McCormick’s Chief Information and Digital Officer. Though only seven months into his role, Peri is rapidly advancing a comprehensive transformation of the company’s digital capabilities. His career-long focus on data and analytics informs

What is the Chain of Emotion in Prompt Engineering? - Analytics VidhyaApr 19, 2025 am 11:33 AM

Introduction Artificial intelligence (AI) is evolving to understand not just words, but also emotions, responding with a human touch. This sophisticated interaction is crucial in the rapidly advancing field of AI and natural language processing. Th

12 Best AI Tools for Data Science Workflow - Analytics VidhyaApr 19, 2025 am 11:31 AM

Introduction In today's data-centric world, leveraging advanced AI technologies is crucial for businesses seeking a competitive edge and enhanced efficiency. A range of powerful tools empowers data scientists, analysts, and developers to build, depl

AV Byte: OpenAI's GPT-4o Mini and Other AI InnovationsApr 19, 2025 am 11:30 AM

This week's AI landscape exploded with groundbreaking releases from industry giants like OpenAI, Mistral AI, NVIDIA, DeepSeek, and Hugging Face. These new models promise increased power, affordability, and accessibility, fueled by advancements in tr

Perplexity's Android App Is Infested With Security Flaws, Report FindsApr 19, 2025 am 11:24 AM

But the company’s Android app, which offers not only search capabilities but also acts as an AI assistant, is riddled with a host of security issues that could expose its users to data theft, account takeovers and impersonation attacks from malicious

Everyone's Getting Better At Using AI: Thoughts On Vibe CodingApr 19, 2025 am 11:17 AM

You can look at what’s happening in conferences and at trade shows. You can ask engineers what they’re doing, or consult with a CEO. Everywhere you look, things are changing at breakneck speed. Engineers, and Non-Engineers What’s the difference be

Rocket Launch Simulation and Analysis using RocketPy - Analytics VidhyaApr 19, 2025 am 11:12 AM

Simulate Rocket Launches with RocketPy: A Comprehensive Guide This article guides you through simulating high-power rocket launches using RocketPy, a powerful Python library. We'll cover everything from defining rocket components to analyzing simula

See all articles