Home  >  Article  >  Technology peripherals  >  Non-Transformer architecture stands up! The first pure attention-free large model, surpassing the open source giant Llama 3.1

Non-Transformer architecture stands up! The first pure attention-free large model, surpassing the open source giant Llama 3.1

WBOY
WBOYOriginal
2024-08-13 16:37:46405browse
맘바 아키텍처의 대형 모델이 다시 한 번 트랜스포머에 도전했습니다.

이번에는 드디어 Mamba 아키텍처 모델이 "일어설" 것인가? Mamba는 2023년 12월 처음 출시된 이후 Transformer의 심각한 경쟁자로 등장했습니다.

이후 Mistral에서 출시한 Mamba 아키텍처 기반 최초의 오픈소스 대형 모델인 Codestral 7B 등 Mamba 아키텍처를 사용하는 모델이 계속해서 등장했습니다.

오늘 아부다비 기술 혁신 연구소(TII)는 새로운 오픈 소스 Mamba 모델인 Falcon Mamba 7B를 출시했습니다.

非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1

먼저 Falcon Mamba 7B의 주요 특징을 요약해 보겠습니다. 메모리 저장 용량을 늘리지 않고도 모든 길이의 시퀀스를 처리할 수 있으며 단일 24GB A10 GPU에서 실행할 수 있습니다.

현재 Hugging Face에서 Falcon Mamba 7B를 보고 사용할 수 있습니다. 이 인과 디코더 전용 모델은 새로운 Mamba State Space Language Model(SSLM) 아키텍처를 사용하여 다양한 텍스트 생성 작업을 처리합니다.

결과에 따르면 Falcon Mamba 7B는 Meta의 Llama 3 8B, Llama 3.1 8B 및 Mistral 7B를 포함한 여러 벤치마크에서 동급 크기의 주요 모델을 능가합니다.

非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1

Falcon Mamba 7B는 기본 버전, 명령 미세 조정 버전, 4비트 버전 및 명령 미세 조정 4비트 버전의 네 가지 변형 모델로 나뉩니다.

非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1

Falcon Mamba 7B는 오픈 소스 모델로서 Apache 2.0 기반 라이선스 "Falcon License 2.0"을 채택하여 연구 및 응용 목적을 지원합니다.

非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1

Hugging Face 주소: https://huggingface.co/tiiuae/falcon-mamba-7b

Falcon Mamba 7B는 Falcon 180B, Falcon 40B 및 Falcon 2 Four에 이어 세 번째 TII 오픈 소스가 되었습니다. 모델이며 최초의 Mamba SSLM 아키텍처 모델입니다.

非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1

The first general-purpose large-scale pure Mamba model

For a long time, Transformer-based models have dominated generative AI. However, researchers have noticed that the Transformer architecture has difficulty processing long text information. Difficulties may be encountered.

Essentially, the attention mechanism in Transformer understands the context by comparing each word (or token) with each word in the text, which requires more computing power and memory requirements to handle the growing context window.

But if the computing resources are not expanded accordingly, the model inference speed will slow down, and text exceeding a certain length cannot be processed. To overcome these obstacles, the State Space Language Model (SSLM) architecture, which works by continuously updating the state while processing words, has emerged as a promising alternative and is being deployed by many institutions including TII. This kind of architecture.

Falcon Mamba 7B uses the Mamba SSM architecture originally proposed in a December 2023 paper by researchers at Carnegie Mellon University and Princeton University.

The architecture uses a selection mechanism that allows the model to dynamically adjust its parameters based on the input. In this way, the model can focus on or ignore specific inputs, similar to how the attention mechanism works in Transformer, while providing the ability to process long sequences of text (such as entire books) without requiring additional memory or computing resources.

TII noted that this approach makes the model suitable for tasks such as enterprise-level machine translation, text summarization, computer vision and audio processing tasks, and estimation and prediction.

Training Data

Falcon Mamba 7B Training data is up to 5500GT, mainly composed of RefinedWeb dataset, with the addition of high-quality technical data, code data and mathematical data from public sources. All data is tokenized using Falcon-7B/11B tokenizers.

Similar to other Falcon series models, Falcon Mamba 7B is trained using a multi-stage training strategy, the context length is increased from 2048 to 8192. In addition, inspired by the concept of course learning, TII carefully selects mixed data throughout the training phase, fully considering the diversity and complexity of the data.

In the final training stage, TII uses a small set of high-quality curated data (i.e. samples from Fineweb-edu) to further improve performance.

Training process, hyperparameters

Most of the training of Falcon Mamba 7B is completed on 256 H100 80GB GPUs, using 3D parallelism (TP=1, PP=1, DP=256) strategy combined with ZeRO. The figure below shows the model hyperparameter details, including accuracy, optimizer, maximum learning rate, weight decay and batch size.

非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1

Specifically, Falcon Mamba 7B was trained with the AdamW optimizer, WSD (warm-stabilize-decay) learning rate plan, and during the training process of the first 50 GT, the batch size increased from b_min=128 to b_max=2048 .

In the stable phase, TII uses the maximum learning rate η_max=6.4×10^−4, and then decays it to the minimum value using an exponential plan over 500GT非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1. At the same time, TII uses BatchScaling in the acceleration phase to re-adjust the learning rate η so that the Adam noise temperature 非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1 remains constant.

The entire model training took about two months.

Model Evaluation

To understand how Falcon Mamba 7B compares to leading Transformer models in its size class, the study conducted a test to determine what the model can handle using a single 24GB A10 GPU Maximum context length.

The results show that Falcon Mamba is able to adapt to larger sequences than the current Transformer model, while theoretically able to adapt to unlimited context lengths.

非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1

Next, the researchers measured the model generation throughput using a batch size of 1 and a hardware setting of H100 GPU. The results are shown in the figure below, Falcon Mamba generates all tokens at constant throughput without any increase in CUDA peak memory. For Transformer models, peak memory increases and generation speed slows down as the number of tokens generated increases.

非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1

Even on standard industry benchmarks, the new model performs better than or close to popular transformer models as well as pure and hybrid state-space models.

For example, in the Arc, TruthfulQA and GSM8K benchmarks, Falcon Mamba 7B scored 62.03%, 53.42% and 52.54% respectively, surpassing Llama 3 8B, Llama 3.1 8B, Gemma 7B and Mistral 7B. However, the Falcon Mamba 7B lags far behind these models in the MMLU and Hellaswag benchmarks.

非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1

非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1

TII principal investigator Hakim Hacid said in a statement: The launch of Falcon Mamba 7B represents a major step forward for the agency, inspiring new perspectives and furthering the push for intelligence Systematic exploration. At TII, they are pushing the boundaries of SSLM and transformer models to inspire further innovation in generative AI.

Currently, TII’s Falcon family of language models has been downloaded more than 45 million times – making it one of the most successful LLM versions in the UAE.

Falcon Mamba 7B paper will be released soon, you can wait a moment.

Reference link:
https://huggingface.co/blog/falconmamba
https://venturebeat.com/ai/falcon-mamba-7bs-powerful -new-ai-architecture-offers-alternative-to-transformer-models/

The above is the detailed content of Non-Transformer architecture stands up! The first pure attention-free large model, surpassing the open source giant Llama 3.1. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn