


For many years, language models have been the core of natural language processing (NLP) technology. In view of the huge commercial value behind the model, the technical details of the most advanced model have not been made public.
Now, the truly completely open source large model is here!
Researchers from the Allen Institute for Artificial Intelligence, the University of Washington, Yale University, New York University, and Carnegie Mellon University recently collaborated to publish an important work, this The work will become an important milestone for the AI open source community.
They will open source almost all the data and information in the process of training a large model from scratch!
Paper: https://allenai.org/olmo/olmo-paper.pdf
## Weight: https://huggingface.co/allenai/OLMo-7B
Code: https://github.com/allenai/OLMo
Data: https://huggingface.co/datasets/allenai/dolma
Evaluation: https://github.com/allenai/OLMo-Eval
Adaptation: https://github.com/allenai/open-instruct
Training and modeling: It includes complete model weights, training code, training logs, ablation studies, training metrics, and inference code.
Pre-training corpus: A pre-training open source corpus containing up to 3T tokens, as well as the code to generate these training data.
Model parameters: The OLMo framework provides four different architectures, optimizers and training hardware There are 7B size models under the system and a 1B size model. All models are trained on at least 2T tokens.
At the same time, the code used for model inference, various indicators of the training process, and training logs are also provided.
Evaluation Tools: exposes a suite of evaluation tools during the development process, including more than 500 checks included in every 1000 steps of each model training process points as well as the evaluation code.
All data are licensed for use under apache 2.0 (free for commercial use).
Performance Evaluation
From the core evaluation results, OLMo-7B is slightly better than similar open source models.Among the first 9 evaluations, OLMo-7B ranked in the top three in 8, and 2 of them surpassed all other models.
On many generation tasks or reading comprehension tasks (such as truthfulQA), OLMo-7B surpasses Llama 2, but on some popular question and answer tasks (such as MMLU or Big-bench Hard ), the performance is worse.
The first 9 tasks are the researchers’ internal evaluation criteria for the pre-trained model, while the following three tasks are added to improve the HuggingFace Open LLM rankings
The figure below shows the changing trend of the accuracy of 9 core tasks.
Except for OBQA, as OLMo-7B receives more data for training, the accuracy of almost all tasks shows an upward trend.
Meanwhile, the core evaluation results of OLMo 1B and its similar models show that OLMo is on the same level as them.
By using the Allen AI Institute’s Paloma, a benchmark, and accessible checkpoints, the researchers analyzed the model’s ability to predict language Relationship with model size factors (such as the number of tokens trained).
It can be seen that OLMo-7B is on par with mainstream models in performance. Among them, the lower the number of bits per byte (Bits per Byte), the better.
Through these analyses, the researchers found that the efficiency of the models in processing different data sources varies greatly, which mainly depends on the model training data and evaluation Data similarity.
In particular, OLMo-7B performs well on data sources mainly based on Common Crawl (such as C4).
However, OLMo-7B is less efficient than other models on data sources that have little to do with web scraping text, such as WikiText-103, M2D2 S2ORC and M2D2 Wikipedia .
RedPajama's evaluation also reflects a similar trend, possibly because only 2 of its 7 fields are derived from Common Crawl, and Paloma's evaluation of each field in each data source given equal weight.
Given that curated data sources like Wikipedia and arXiv papers provide far less heterogeneous data than web scraped text, maintaining an understanding of these as pre-training datasets continue to expand Efficient language distribution will be more difficult.
OLMo Architecture
In terms of model architecture, the team is based on the decoder-only Transformer architecture and adopts PaLM and the SwiGLU activation function used by Llama, introduced Rotated Position Embedding (RoPE), and improved GPT-NeoX-20B’s Byte Pair Encoding (BPE)-based tokenizer to reduce personally identifiable information in the model output.
In addition, in order to ensure the stability of the model, the researchers did not use bias terms (this is the same as PaLM).
As shown in the table below, the researchers have released two versions, 1B and 7B, and also plan to launch a 65B version soon.
The table below details the performance of the 7B architecture with these other models at similar scales.
Pre-training data set: Dolma
Although researchers have made certain progress in obtaining model parameters progress, but the current openness of pre-training data sets in the open source community is far from enough.
Previous pre-training data is often not made public with the open source of the model (let alone closed source models).
And the documentation about these data often lacks sufficient details that are crucial to replicating the research or fully understanding the related work.
This situation makes language model research more difficult—for example, understanding how training data affects model capabilities and its limitations.
In order to promote open research in the field of language model pre-training, researchers constructed and made public the pre-training data set Dolma.
This is a diverse, multi-source corpus containing 3 trillion tokens obtained from 7 different data sources.
On the one hand, these data sources are common in large-scale language model pre-training, and on the other hand, they are also accessible to the general public.
The table below gives an overview of the data volume from various data sources.
Dolma’s construction process includes six steps: language filtering, quality filtering, content filtering, deduplication, multi-source mixing and tokenization.
During the process of collating and final publishing Dolma, researchers ensured that documents from each data source remained independent.
They also open sourced a set of efficient data sorting tools, which can help further study Dolma, replicate results, and simplify the sorting of pre-training corpus.
In addition, researchers have also open sourced the WIMBD tool to facilitate data set analysis.
Network data processing process
Code processing process
Training OLMo
Distributed training framework
The researchers used PyTorch’s FSDP framework and ZeRO optimizer strategy to train the model . This approach effectively reduces memory usage by splitting the model’s weights and their corresponding optimizer states across multiple GPUs.
When processing models up to 7B in size, this technology enables researchers to process micro-batch sizes of 4096 tokens per GPU for more efficient training.
For the OLMo-1B and 7B models, the researchers fixed a global batch size of approximately 4M tokens (2048 data instances, each instance containing a sequence of 2048 tokens).
For the OLMo-65B model currently being trained, the researchers adopted a batch size warm-up strategy, starting at about 2M tokens (1024 data instances), and then every Adding 100B tokens doubles the batch size until it finally reaches about 16M tokens (8192 data instances).
In order to speed up model training, the researchers used mixed precision training technology, which is used through the internal configuration of FSDP and PyTorch This is implemented using the amp module.
This method is specially designed to ensure that some key calculation steps (such as the softmax function) are always performed with the highest accuracy to ensure the stability of the training process.
Meanwhile, most other calculations use a half-precision format called bfloat16 to reduce memory usage and increase computational efficiency.
In specific configurations, model weights and optimizer state are saved with maximum accuracy on each GPU.
Only when performing forward propagation and back propagation of the model, that is, calculating the output of the model and updating the weights, the weights within each Transformer module will be temporarily converted to bfloat16 format.
In addition, when gradient updates are synchronized between GPUs, they will also be performed with the highest accuracy to ensure training quality.
Optimizer
The researchers used the AdamW optimizer to adjust model parameters.
Regardless of the size of the model, researchers will gradually increase the learning rate within the first 5,000 steps of training (approximately processing 21B tokens). This process is called learning rate warm-up.
After the warm-up is completed, the learning rate will gradually decrease linearly until it drops to one-tenth of the maximum learning rate.
In addition, the researchers will also clip the gradients of the model parameters to ensure that their total L1 norm does not exceed 1.0.
In the table below, the researchers compare their optimizer configuration at 7B model scale to other recent large language models using the AdamW optimizer.
Dataset
The researchers used a 2T token sample in the open data set Dolma, constructed their training data set.
The researchers connected the tokens of each document, added a special EOS token at the end of each document, and then divided these tokens into groups of 2048 to form training samples. .
These training samples will be randomly shuffled in the same way during each training. The researchers also provide tools that allow anyone to recover the specific data order and composition of each training batch.
All models that researchers have released have been trained for at least one round (2T tokens). Some of these models were additionally trained by running a second round of training on the data, but with a different random shuffling order.
According to previous research, the impact of reusing a small amount of data in this way is minimal.
NVIDIA and AMD both want YES!
In order to ensure that the code base can run efficiently on both NVIDIA and AMD GPUs, the researchers selected two different clusters for model training and testing:
Using the LUMI supercomputer, researchers deployed up to 256 nodes, each equipped with 4 AMD MI250X GPUs. Each GPU has 128GB of memory and a data transfer rate of 800Gbps.
With the support of MosaicML (Databricks), the researchers used 27 nodes, each node is equipped with 8 NVIDIA A100 GPUs, each GPU has 40GB memory and 800Gbps data transmission rate.
Although the researchers fine-tuned the batch size to improve training efficiency, after completing the evaluation of 2T tokens, there was almost no difference in the performance of the two clusters.
Training energy consumption
Summary
Unlike most previous models that only provide models The models of weight and inference code are different. The researchers open sourced all the contents of OLMo, including training data, training and evaluation code, as well as training logs, experimental results, important findings, records of Weights & Biases, etc.
Additionally, the team is studying how to improve OLMo through instruction optimization and different types of reinforcement learning (RLHF). These fine-tuned codes, data and fine-tuned models will also be open source.
Researchers are committed to continuously supporting and developing OLMo and its framework, promoting the development of open language models (LM), and assisting the development of open research communities. To this end, the researchers plan to introduce more models of different scales, multiple modalities, data sets, security measures and evaluation methods to enrich the OLMo family.
They hope that through continued thorough open source work in the future, they will strengthen the power of the open source research community and trigger a new wave of innovation.
Team introduction
Yizhong Wang (王义中)
Yizhong Wang is Washington PhD student in the university's Paul G. Allen School of Computer Science and Engineering, mentored by Hannaneh Hajishirzi and Noah Smith. At the same time, he is also a part-time research intern at the Allen Institute for Artificial Intelligence.
Previously, he had interned at Meta AI, Microsoft Research and Baidu NLP. Previously, he received a master's degree from Peking University and a bachelor's degree from Shanghai Jiao Tong University.
His research directions are Natural Language Processing, Machine Learning, and Large Language Model (LLM).
- Adaptability of LLM: How to more efficiently build and evaluate models that can follow instructions? What factors should we consider when fine-tuning these models, and how do they affect the generalizability of the model? Which types of supervision are both effective and scalable?
- Continuous learning for LLM: Where is the boundary between pre-training and fine-tuning? What architectures and learning strategies can allow LLM to continue to evolve after pre-training? How does existing knowledge within the model interact with newly learned knowledge?
- Application of large-scale synthetic data: Today, when generative models rapidly generate data, what impact does this data have on our model development and even the entire Internet and society? How do we ensure we can generate diverse and high-quality data at scale? Can we distinguish this data from human-generated data?
Yuling Gu
##Yuling Gu is a member of the Aristo team at the Allen Institute for Artificial Intelligence (AI2) A researcher.
In 2020, she received her bachelor’s degree from New York University (NYU). In addition to her computer science major, she also minored in an interdisciplinary major, Language and Mind, which combines linguistics, psychology, and philosophy. She subsequently earned a master's degree from the University of Washington (UW).
She is full of enthusiasm for the integration and application of machine learning technology and cognitive science theory.
The above is the detailed content of The first 100% open source large model in history is here! Record-breaking disclosure of code/weights/data sets/whole training process, AMD can train it. For more information, please follow other related articles on the PHP Chinese website!

1 前言在发布DALL·E的15个月后,OpenAI在今年春天带了续作DALL·E 2,以其更加惊艳的效果和丰富的可玩性迅速占领了各大AI社区的头条。近年来,随着生成对抗网络(GAN)、变分自编码器(VAE)、扩散模型(Diffusion models)的出现,深度学习已向世人展现其强大的图像生成能力;加上GPT-3、BERT等NLP模型的成功,人类正逐步打破文本和图像的信息界限。在DALL·E 2中,只需输入简单的文本(prompt),它就可以生成多张1024*1024的高清图像。这些图像甚至

“Making large models smaller”这是很多语言模型研究人员的学术追求,针对大模型昂贵的环境和训练成本,陈丹琦在智源大会青源学术年会上做了题为“Making large models smaller”的特邀报告。报告中重点提及了基于记忆增强的TRIME算法和基于粗细粒度联合剪枝和逐层蒸馏的CofiPruning算法。前者能够在不改变模型结构的基础上兼顾语言模型困惑度和检索速度方面的优势;而后者可以在保证下游任务准确度的同时实现更快的处理速度,具有更小的模型结构。陈丹琦 普

Wav2vec 2.0 [1],HuBERT [2] 和 WavLM [3] 等语音预训练模型,通过在多达上万小时的无标注语音数据(如 Libri-light )上的自监督学习,显著提升了自动语音识别(Automatic Speech Recognition, ASR),语音合成(Text-to-speech, TTS)和语音转换(Voice Conversation,VC)等语音下游任务的性能。然而这些模型都没有公开的中文版本,不便于应用在中文语音研究场景。 WenetSpeech [4] 是

由于复杂的注意力机制和模型设计,大多数现有的视觉 Transformer(ViT)在现实的工业部署场景中不能像卷积神经网络(CNN)那样高效地执行。这就带来了一个问题:视觉神经网络能否像 CNN 一样快速推断并像 ViT 一样强大?近期一些工作试图设计 CNN-Transformer 混合架构来解决这个问题,但这些工作的整体性能远不能令人满意。基于此,来自字节跳动的研究者提出了一种能在现实工业场景中有效部署的下一代视觉 Transformer——Next-ViT。从延迟 / 准确性权衡的角度看,

3月27号,Stability AI的创始人兼首席执行官Emad Mostaque在一条推文中宣布,Stable Diffusion XL 现已可用于公开测试。以下是一些事项:“XL”不是这个新的AI模型的官方名称。一旦发布稳定性AI公司的官方公告,名称将会更改。与先前版本相比,图像质量有所提高与先前版本相比,图像生成速度大大加快。示例图像让我们看看新旧AI模型在结果上的差异。Prompt: Luxury sports car with aerodynamic curves, shot in a

人工智能就是一个「拼财力」的行业,如果没有高性能计算设备,别说开发基础模型,就连微调模型都做不到。但如果只靠拼硬件,单靠当前计算性能的发展速度,迟早有一天无法满足日益膨胀的需求,所以还需要配套的软件来协调统筹计算能力,这时候就需要用到「智能计算」技术。最近,来自之江实验室、中国工程院、国防科技大学、浙江大学等多达十二个国内外研究机构共同发表了一篇论文,首次对智能计算领域进行了全面的调研,涵盖了理论基础、智能与计算的技术融合、重要应用、挑战和未来前景。论文链接:https://spj.scien

译者 | 李睿审校 | 孙淑娟近年来, Transformer 机器学习模型已经成为深度学习和深度神经网络技术进步的主要亮点之一。它主要用于自然语言处理中的高级应用。谷歌正在使用它来增强其搜索引擎结果。OpenAI 使用 Transformer 创建了著名的 GPT-2和 GPT-3模型。自从2017年首次亮相以来,Transformer 架构不断发展并扩展到多种不同的变体,从语言任务扩展到其他领域。它们已被用于时间序列预测。它们是 DeepMind 的蛋白质结构预测模型 AlphaFold

说起2010年南非世界杯的最大网红,一定非「章鱼保罗」莫属!这只位于德国海洋生物中心的神奇章鱼,不仅成功预测了德国队全部七场比赛的结果,还顺利地选出了最终的总冠军西班牙队。不幸的是,保罗已经永远地离开了我们,但它的「遗产」却在人们预测足球比赛结果的尝试中持续存在。在艾伦图灵研究所(The Alan Turing Institute),随着2022年卡塔尔世界杯的持续进行,三位研究员Nick Barlow、Jack Roberts和Ryan Chan决定用一种AI算法预测今年的冠军归属。预测模型图


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SublimeText3 Chinese version
Chinese version, very easy to use

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SublimeText3 Linux new version
SublimeText3 Linux latest version

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.
