Large models are ushering in the 'open source season', taking stock of the open source LLM and data sets in the past month-AI-php.cn

Large models are ushering in the 'open source season', taking stock of the open source LLM and data sets in the past month

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

May 18, 2023 pm 04:31 PM

aiOpen source

Some time ago, Google’s leaked internal documents expressed the view that although on the surface it seems that OpenAI and Google are chasing each other on large AI models, the real winner may not come from these two, because There is a third party force that is quietly rising. This power is "open source".

Around Meta's LLaMA open source model, the entire community is rapidly building models with similar capabilities to OpenAI and Google's large models. Moreover, the open source model iterates faster and is more customizable. More privacy.

Recently, Sebastian Raschka, former assistant professor at the University of Wisconsin-Madison and chief AI education officer of the startup Lightning AI, said that The past month has been very difficult for open source. great.

However, with so many large language models (LLM) emerging one after another, it is not easy to keep a firm grasp of all models. So, in this article, Sebastian shares resources and research insights on the latest open source LLMs and datasets.

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

Papers and Trends

There have been so many research papers appearing in the past month that it can be difficult to choose from them Select the most favorite ones for in-depth discussion. Sebastian prefers papers that provide additional insights rather than simply demonstrate more powerful models. In view of this, what first caught his attention was the Pythia paper co-authored by researchers from Eleuther AI and Yale University and other institutions.

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

Paper address: https://arxiv.org/pdf/2304.01373.pdf

#Pythia: Gaining insights from large-scale training

Open source Pythia family of large models is really the other autoregressive decoding An interesting alternative to the container-style model (i.e. GPT-like model). The paper reveals some interesting insights into the training mechanism and introduces corresponding models ranging from 70M to 12B parameters.

The Pythia model architecture is similar to GPT-3 but includes improvements such as Flash attention (like LLaMA) and rotational position embedding (like PaLM). At the same time, Pythia was trained with 300B tokens on the 800GB diverse text data set Pile (1 epoch on the regular Pile and 1.5 epoch on the deduplication Pile).

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

The following are some insights and reflections from the Pythia paper:

In Will there be any impact if training on repeated data (i.e. training epoch>1) will have any impact? Results show that data deduplication does not improve or harm performance;
Do training commands affect memory? Unfortunately, it turns out not. I say sorry because if it does, the nasty verbatim memory problem can be alleviated by reordering the training data;
Doubling the batch size can halve the training time but Does not impair convergence.

Open Source Data

The past month has been particularly exciting for open source AI, with several An open source implementation of LLM and a large set of open source data sets. These datasets include Databricks Dolly 15k, OpenAssistant Conversations (OASST1) for instruction fine-tuning, and RedPajama for pre-training. These dataset efforts are especially laudable because data collection and cleaning accounts for 90% of real-world machine learning projects, yet few people enjoy this work.

Databricks-Dolly-15 dataset

Databricks-Dolly-15 is a data set used for LLM fine-tuning set of over 15,000 instruction pairs written by thousands of DataBricks employees (similar to training systems like InstructGPT and ChatGPT).

OASST1 Dataset

OASST1 dataset is used to fine-tune a pre-trained LLM on a collection of ChatGPT assistant-like conversations created and annotated by humans. , containing 161,443 messages written in 35 languages and 461,292 quality assessments. These are organized in over 10,000 fully annotated dialogue trees.

RedPajama dataset used for pre-training

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

RedPajama is an open source data set for LLM pre-training, similar to Meta's SOTA LLaMA model. This dataset aims to create an open source competitor to most popular LLMs, which are currently either closed source business models or only partially open source.

The bulk of RedPajama consists of CommonCrawl, which filters sites in English, but Wikipedia articles cover 20 different languages.

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

LongForm Dataset

Paper《 The LongForm: Optimizing Instruction Tuning for Long Text Generation with Corpus Extraction" introduces a collection of manually created documents based on existing corpora such as C4 and Wikipedia and the instructions of these documents, thereby creating an instruction tuning data set suitable for long text generation.

Paper address: https://arxiv.org/abs/2304.08460

Alpaca Libre project

#The Alpaca Libre project aims to recreate the Alpaca project by converting 100k MIT licensed demos from the Anthropics HH-RLHF repository into an Alpaca compatible format.

Expand the open source data set

Instruction fine-tuning is how we evolve from a GPT-3-like pre-trained basic model to a more The key to a powerful ChatGPT-like large language model. Open source human-generated instruction datasets such as Databricks-Dolly-15 help achieve this. But how do we scale further? Is it possible not to collect additional data? One approach is to bootstrap an LLM from its own iteration. Although the Self-Instruct method was proposed 5 months ago (and is outdated by today's standards), it is still a very interesting method. It is worth emphasizing that it is possible to align pretrained LLMs with instructions thanks to Self-Instruct, a method that requires almost no annotations.

How does it work? In short, it can be divided into the following four steps:

The first is a seed task pool with a set of manually written instructions (175 in this case) and sample instructions;
Secondly use a pre-training LLM (such as GPT-3) to determine the task category;
Then give new instructions to make the pre-training The LLM generates the response;
# and finally collects, trims, and filters the response before adding the instruction to the task pool.

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

In practice, work based on ROUGE scores will be more effective. For example, Self-Instruct fine-tuned LLM is better than GPT-3 based LLM, and can compete with LLMs pretrained on large human-written instruction sets. At the same time, self-instruct can also benefit from LLM that has been fine-tuned on manual instructions.

But of course, the gold standard for evaluating LLMs is asking human raters. Based on human evaluation, Self-Instruct outperforms basic LLMs, as well as LLMs trained on human instruction datasets in a supervised manner (such as SuperNI, T0 Trainer). Interestingly, however, Self-Instruct does not perform better than methods trained with reinforcement learning with human feedback (RLHF).

Artificially generated vs synthetic training data set

Artificially generated instruction data set or self-instruct data set, which one is more promising? Woolen cloth? Sebastian sees a future in both. Why not start with a manually generated instruction data set (e.g. databricks-dolly-15k of 15k instructions) and then extend it using self-instruct? The paper "Synthetic Data from Diffusion Models Improves ImageNet Classification" shows that combining real image training sets with AI-generated images can improve model performance. It would be interesting to explore whether this is also true for text data.

Paper address: https://arxiv.org/abs/2304.08466

Recent paper "Better Language" Models of Code through Self-Improvement" is research in this direction. The researchers found that code generation tasks can be improved if a pretrained LLM uses its own generated data.

Paper address: https://arxiv.org/abs/2304.01228

Less is more Less is more?

In addition, in addition to pre-training and fine-tuning the model on larger and larger data sets, how can we improve the performance of smaller data sets? What about the efficiency? The paper "Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes" proposes using a distillation mechanism to manage task-specific smaller models that use less training data but exceed standard fine-tuning. performance.

Paper address: https://arxiv.org/abs/2305.02301

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

Tracking open source LLM

The number of open source LLM is exploding. On the one hand, it is a very good development trend (compared to controlling the model through paid API ), but on the other hand keeping track of it all can be cumbersome. The following four resources provide different summaries of most relevant models, including their relationships, underlying datasets, and various licensing information.

The first resource is the ecosystem graph website based on the paper "Ecosystem Graphs: The Social Footprint of Foundation Models", which provides the following tables and interactive dependency graphs (not shown here).

This ecosystem diagram is the most comprehensive list Sebastian has seen to date, but it can be a bit confusing because it includes many less popular LLMs. Checking the corresponding GitHub repository shows that it has been updated for at least a month. It's also unclear if it will add newer models.

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

##Paper address: https://arxiv.org/abs/2303.15772
Ecosystem graph website address: https://crfm.stanford.edu/ecosystem-graphs/index.html?mode=table

The second resource is the beautifully drawn evolutionary tree from the recent paper Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond, which focuses on the most popular LLMs and their Relationship.

Although readers have seen a very beautiful and clear visual LLM evolutionary tree, there are also some minor doubts. It's not clear why the bottom doesn't start from the original transformer architecture. Also open source labels are not very accurate, for example LLaMA is listed as open source, but the weights are not available under an open source license (only the inference code is).

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

##Paper address: https://arxiv.org/abs/2304.13712

The third resource is a table drawn by Sebastian's colleague Daniela Dapena, from the blog "The Ultimate Battle of Language Models: Lit-LLaMA vs GPT3.5 vs Bloom vs...".

Although the table below is smaller than other resources, it has the advantage of including model dimensions and licensing information. This table will be very useful if you plan to use these models in any project.

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

Blog address: https://lightning.ai/pages/community/community-discussions/the-ultimate-battle-of -language-models-lit-llama-vs-gpt3.5-vs-bloom-vs/

The fourth resource is the LLaMA-Cult-and-More overview table , which provides additional information on fine-tuning methods and hardware costs.

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

Overview table address: https://github.com/shm007g/LLaMA-Cult-and- More/blob/main/chart.md

Use LLaMA-Adapter V2 to fine-tune multi-modal LLM

Sebastian predicts that we will see more multi-modal LLM models this month, so we have to talk about the recently released paper "LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model". First, let’s review what is LLaMA-Adapter? It is a parameter-efficient LLM fine-tuning technique that modifies the previous transformer blocks and introduces a gating mechanism to stabilize training.

Paper address: https://arxiv.org/abs/2304.15010

Using the LLaMA-Adapter method, The researchers were able to fine-tune a 7B parameter LLaMA model in just 1 hour (8 A100 GPUs) on 52k instruction pairs. Although only the newly added 1.2M parameters (adapter layer) have been fine-tuned, the 7B LLaMA model is still frozen.

The focus of LLaMA-Adapter V2 is multi-modality, that is, building a visual command model that can receive image input. Although the original V1 could receive text tokens and image tokens, images were not fully explored.

LLaMA-Adapter From V1 to V2, researchers improved the adapter method through the following three main techniques.

Early visual knowledge fusion: Instead of fusing visual and adapted cues at each adapted layer, the visual token is connected to the word token in the first transformer block ;
Use more parameters: unfreeze all normalization layers and add bias units and scaling factors to each linear layer in the transformer block;
Joint training with disjoint parameters: for subtitle data, only the visual projection layer is trained; for data followed by instructions, only the adaptation layer (and the newly added parameters mentioned above) is trained.

LLaMA V2 (14M) has many more parameters than LLaMA V1 (1.2 M), but it is still lightweight, accounting for only 0.02% of the total parameters of 65B LLaMA . Particularly impressive is that by fine-tuning only 14M parameters of the 65B LLaMA model, the resulting LLaMA-Adapter V2 performs on par with ChatGPT (when evaluated using the GPT-4 model). LLaMA-Adapter V2 also outperforms the 13B Vicuna model using the full fine-tuning approach.

Unfortunately, the LLaMA-Adapter V2 paper omits the computational performance benchmark included in the V1 paper, but we can assume that V2 is still much faster than the fully fine-tuned method.

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

Other open source LLM

The development of large models is so fast that we can’t list them all. Some of the famous open source LLM and chatbots launched this month include Open-Assistant, Baize, StableVicuna, ColossalChat, Mosaic’s MPT, etc. Additionally, below are two particularly interesting multimodal LLMs.

OpenFlamingo

OpenFlamingo is an open source copy of the Flamingo model released by Google DeepMind last year. OpenFlamingo aims to provide multi-modal image inference capabilities for LLM, allowing people to interleave text and image input.

MiniGPT-4

##MiniGPT-4 is another open source model with visual language capabilities. It is based on the BLIP-27 frozen visual encoder and the frozen Vicuna LLM.

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

NeMo Guardrails

With these With the emergence of large language models, many companies are thinking about how and whether they should deploy them, and security concerns are particularly prominent. There are no good solutions yet, but there is at least one more promising approach: NVIDIA has open sourced a toolkit to solve the LLM hallucination problem.

In a nutshell, how it works is that this method uses database links to hard-coded prompts that must be managed manually. Then, if the user enters prompt, that content will be matched first to the most similar entry in that database. The database then returns a hardcoded prompt which is then passed to LLM. So if one carefully tests the hardcoded prompt, one can ensure that the interaction does not deviate from allowed topics etc.

Large models are ushering in the open source season, taking stock of the open source LLM and data sets in the past month

##This is an interesting but not groundbreaking approach as it does not offer anything better or new for LLM capabilities, it simply limits the extent to which a user can interact with LLM. Still, until researchers find alternative ways to mitigate hallucination problems and negative behaviors in LLM, this may be a viable approach.

The guardrails approach can also be combined with other alignment techniques, such as the popular human-feedback reinforcement learning training paradigm that the authors introduced in a previous issue of Ahead of AI.

Consistency Model

Talking about interesting models other than LLM is a good attempt, and OpenAI finally open sourced them Code for the consistency model: https://github.com/openai/consistency_models.

The consistency model is considered a feasible and effective alternative to the diffusion model. You can get more information in the consistency model paper.

The above is the detailed content of Large models are ushering in the 'open source season', taking stock of the open source LLM and data sets in the past month. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

How to Fine-Tune Large Language Models with MonsterAPIApr 19, 2025 am 10:49 AM

Harness the Power of Fine-tuned LLMs with MonsterAPI: A Comprehensive Guide Imagine a virtual assistant perfectly understanding and anticipating your needs. This is becoming reality thanks to advancements in Large Language Models (LLMs). However, a

5 Statistical Tests Every Data Scientist Should Know - Analytics VidhyaApr 19, 2025 am 10:27 AM

Data Science's Essential Statistical Tests: A Comprehensive Guide Unlocking valuable insights from data is paramount in data science. Mastering statistical tests is fundamental to achieving this. These tests empower data scientists to rigorously val

How to Perform Computer Vision Tasks with Florence-2 - Analytics VidhyaApr 19, 2025 am 10:21 AM

Introduction The introduction of the original transformers paved the way for the current Large Language Models. Similarly, after the introduction of the transformer model, the vision transformer (ViT) was introduced. Like the

7 Ways to Split Data Using LangChain Text Splitters - Analytics VidhyaApr 19, 2025 am 10:11 AM

LangChain Text Splitters: Optimizing LLM Input for Efficiency and Accuracy Our previous article covered LangChain's document loaders. However, LLMs have context window size limitations (measured in tokens). Exceeding this limit truncates data, comp

Free Generative AI Course: Pioneering the Future of InnovationApr 19, 2025 am 10:01 AM

Generative AI: Revolutionizing Creativity and Innovation Generative AI is transforming industries by creating text, images, music, and virtual worlds at the touch of a button. Its impact spans video editing, music production, art, entertainment, hea

Creating a QA Model with Universal Sentence Encoder and WikiQAApr 19, 2025 am 10:00 AM

Harnessing the Power of Embedding Models for Advanced Question Answering In today's information-rich world, the ability to obtain precise answers instantly is paramount. This article demonstrates building a robust question-answering (QA) model using

Top 10 Must Read Machine Learning Research PapersApr 19, 2025 am 09:53 AM

This article explores ten seminal publications that have revolutionized artificial intelligence (AI) and machine learning (ML). We'll examine recent breakthroughs in neural networks and algorithms, explaining the core concepts driving modern AI. Th

Top 11 AI Tools to Replace SEO Agencies - Analytics VidhyaApr 19, 2025 am 09:49 AM

AI's Rise in SEO: Top 11 Tools to Outperform SEO Agencies The rapid advancement of AI has profoundly reshaped the SEO landscape. Businesses aiming for top search engine rankings are leveraging AI's power to optimize their online strategies. From au

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Where to find the Crane Control Keycard in Atomfall

3 weeks agoByDDD

Saving in R.E.P.O. Explained (And Save Files)

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

4 weeks agoByDDD

Hot Tools

Dreamweaver Mac version

Visual web development tools

Notepad++7.3.1

Easy-to-use and free code editor

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

Hot Topics

Where is the login entrance for gmail email?

7566

CakePHP Tutorial

1386

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

103