


BELLE is based on Stanford Alpaca and optimized for Chinese. Model tuning only uses data produced by ChatGPT (does not include any other data).
It has been almost four months since the initial release of ChatGPT. When GPT-4 was released last week, ChatGPT immediately launched the new version. But a well-known secret is that neither ChatGPT nor GPT-4 are likely to be open source. Coupled with the huge investment in computing power and massive training data, there are many hurdles for the research community to replicate its implementation process.
Faced with the onslaught of large models such as ChatGPT, open source replacement is a good choice. At the beginning of this month, Meta "open sourced" a new large model series - LLaMA (Large Language Model Meta AI), with parameter sizes ranging from 7 billion to 65 billion. The 13 billion parameter LLaMA model outperforms the 175 billion parameter GPT-3 "on most benchmarks" and can run on a single V100 GPU.
After a few days, Stanford fine-tuned a new model Alpaca with 7 billion parameters based on LLaMA 7B. They used the technology introduced in the Self-Instruct paper to generate 52K instruction data, and made some modifications. , In preliminary human evaluations, the Alpaca 7B model performed similarly to the text-davinci-003 (GPT-3.5) model on Self-Instruct instruction evaluation.
Unfortunately, Alpaca’s seed tasks are all in English, and the data collected are also in English, so the trained model is not optimized for Chinese. In order to improve the effectiveness of the dialogue model in Chinese, is there a better way? Don’t worry, the project introduced next can solve this problem very well.
Open source Chinese dialogue large modelBELLE (Bloom-Enhanced Large Language model Engine) with 7 billion parameters is here. It is based on Stanford's Alpaca, but with Chinese optimization and some modifications to the generated code. Not only that, model tuning only uses data produced by ChatGPT (does not contain any other data).
In terms of data, the project open sourced the data collection code based on Alpaca. Based on this code, about 1 million pieces of Chinese data were generated. Combined with 50,000 pieces of English data from Alpaca, it was trained on the BLOOMZ-7B model. The checkpoint is uploaded to Hugging Face.
Hugging Face Address: https://huggingface.co/BelleGroup
The project author stated: This project aims to promote the development of the Chinese dialogue large model open source community.
Project address: https://github.com/LianjiaTech/BELLE
Project introduction
The project mainly includes the following four parts Content:
- 175 Chinese seed tasks
- Code to generate data
- 0.5M generated data
- Based on BLOOMZ-7B1-mt optimization The final model
Data release
1. zh_seed_tasks.jsonl: Contains 175 seed tasks, the sample is as follows
{"id ": "seed_task_20", "name": "horror_movie_opening", "instruction": "You need to write a creative opening scene for a horror movie.", "instances": [{"input": "","output ":" The sun had set, leaving behind a dark town. A gentle breeze blew through the empty streets, sending a chill through anyone who ventured outside. The only sound was the slight rustling of leaves blown by the wind. Sound. Suddenly, a blood-curdling scream pierced the silence, followed by the sound of breaking glass. A house turned on its lights, and a figure could be seen running towards the center of the town. When > The figure became more and more When I got closer, I could clearly see that it was a young woman, covered in blood."}],"is_classification": false}
2. prompt_cn.txt: The prompt used to generate
3. 0.5M generated data
Data generation
Follow Alpaca’s method:
pip install -r requirements.txt
export OPENAI_API_KEY=YOUR_API_KEY
python generate_instruction.py generate_instruction_following_data
Use Completion API by default, model text-davinci-003. If you want to use the Chat API and use the gpt-3.5-turbo model, you can control it through parameters:
python generate_instruction.py generate_instruction_following_data
--api=chat --model_name=gpt-3.5-turbo
The output file is in Belle.train.json and can be manually filtered before use.
Model tuning
This project is based on the BLOOMZ-7B1-mt model and the Belle.train.json training model. The specific parameters are as follows:
In addition, the project also uses instruction learning data sets of different sizes (200,000, 600,000, 1 million and 2 million samples) to train the model, and the different model versions are as follows:
Model usage examples
##Limitations and usage restrictions
The SFT model trained based on the current data and the basic model still has the following problems in terms of effect:- Instructions involving factuality may produce wrong answers that go against the facts.
- Hazardous instructions cannot be well identified, resulting in harmful remarks.
- The model's capabilities still need to be improved in some scenarios involving reasoning, coding, etc.
- Based on the limitations of the above model, this project requires developers to only use open source code, data, models and subsequent derivatives generated by this project for research purposes, and shall not use them for business or other purposes that will harm society. Harmful uses.
The above is the detailed content of To make up for the shortcomings of Stanford's 7 billion parameter 'Alpaca', a large model proficient in Chinese is here and has been open source. For more information, please follow other related articles on the PHP Chinese website!

Running large language models at home with ease: LM Studio User Guide In recent years, advances in software and hardware have made it possible to run large language models (LLMs) on personal computers. LM Studio is an excellent tool to make this process easy and convenient. This article will dive into how to run LLM locally using LM Studio, covering key steps, potential challenges, and the benefits of having LLM locally. Whether you are a tech enthusiast or are curious about the latest AI technologies, this guide will provide valuable insights and practical tips. Let's get started! Overview Understand the basic requirements for running LLM locally. Set up LM Studi on your computer

Guy Peri is McCormick’s Chief Information and Digital Officer. Though only seven months into his role, Peri is rapidly advancing a comprehensive transformation of the company’s digital capabilities. His career-long focus on data and analytics informs

Introduction Artificial intelligence (AI) is evolving to understand not just words, but also emotions, responding with a human touch. This sophisticated interaction is crucial in the rapidly advancing field of AI and natural language processing. Th

Introduction In today's data-centric world, leveraging advanced AI technologies is crucial for businesses seeking a competitive edge and enhanced efficiency. A range of powerful tools empowers data scientists, analysts, and developers to build, depl

This week's AI landscape exploded with groundbreaking releases from industry giants like OpenAI, Mistral AI, NVIDIA, DeepSeek, and Hugging Face. These new models promise increased power, affordability, and accessibility, fueled by advancements in tr

But the company’s Android app, which offers not only search capabilities but also acts as an AI assistant, is riddled with a host of security issues that could expose its users to data theft, account takeovers and impersonation attacks from malicious

You can look at what’s happening in conferences and at trade shows. You can ask engineers what they’re doing, or consult with a CEO. Everywhere you look, things are changing at breakneck speed. Engineers, and Non-Engineers What’s the difference be

Simulate Rocket Launches with RocketPy: A Comprehensive Guide This article guides you through simulating high-power rocket launches using RocketPy, a powerful Python library. We'll cover everything from defining rocket components to analyzing simula


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Atom editor mac version download
The most popular open source editor

SublimeText3 Linux new version
SublimeText3 Linux latest version

SublimeText3 Mac version
God-level code editing software (SublimeText3)

SublimeText3 English version
Recommended: Win version, supports code prompts!

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.