search
HomeTechnology peripheralsAIWith 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born

Large language models (LLMs) including decoder-only structures (such as GPT and LLAMA series models), encoder-only structures (such as BERT), and encoder-decoder structures (such as T5) and their variant models ) has achieved remarkable success and has been widely used in various language processing and multi-modal tasks.

Despite this success, training LLM is so expensive that only a few companies can afford it. In addition, current trends indicate that larger training data will be used in the future, which will further increase the development cost of large models. For example, LLAMA-1 training uses 1-1.4 TB tokens, while Llama 2 reaches 2 TB.

Another key challenge in developing an LLM is evaluation. Mainstream evaluation methods are divided into two categories: knowledge evaluation (MMLU and C-Eval) and NLP task evaluation. These evaluation methods may not truly reflect the capabilities of the model because there may be data leakage issues, i.e. some parts of the evaluation data set may have been used during the model training process. Furthermore, knowledge-oriented assessment methods may not be adequate for assessing intelligence levels. A more fair and objective assessment method is to measure the intelligence quotient (IQ) of the LLM, which is to generalize the LLM to conditions and contexts not seen in the training data.

Growth strategy. In order to solve the training cost problem, many institutions such as Beijing Zhiyuan Artificial Intelligence Research Institute and the Institute of Computing Technology of the Chinese Academy of Sciences have recently made some attempts - training a 100 billion parameter level LLM through a growth strategy for the first time. Growth means that the number of parameters during training is not fixed, but expands from smaller models to larger models.

With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born

  • Paper: https://arxiv.org/pdf/2309.03852.pdf

  • Needs to be reprinted The written content is: Model link: https://huggingface.co/CofeAI/FLM-101B

Figure 1 shows three typical scenarios of growth strategies. Since the FLOPs of an LLM are roughly proportional to the number of its parameters, the area between the change curve of the model parameters and the X-axis can represent the computational cost of training.

With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born


Figure 1 (a) shows the standard training strategy without model growth; 1 (b) is a linear growth strategy, which can save 50% of the cost; 1 (c) It is a moderate growth strategy, which can save less than 50% of the cost; 1 (d) is a radical growth strategy, which can save more than 50% of the cost. This analysis illustrates that in order to save as much computing cost as possible, an aggressive growth strategy should be adopted.

The design of the growth operator of this new study is inspired by the MSG in the paper "2x faster language model pre-training via masked structural growth", which is a complete A set of operations covering all four growth dimensions of the Transformer structure. More importantly, MSG can grow while tightly preserving functionality. Therefore, although a small model can learn quickly with a smaller parameter search space, its knowledge can be inherited by subsequent larger models. This makes it possible for growth strategies to achieve better performance using the same or less computational cost.

Open source FLM-101B model. Researchers at Zhiyuan Research Institute trained an LLM model with 101 billion parameters through gradual growth, and they also stated that they would release the model as open source. The architecture of this model is an evolution of FreeLM. Therefore, the researchers named it FLM-101B, where F stands for Free.

#The FreeLM framework has two pre-training objectives, which are guided by language signals and teacher signals respectively. In this new research, these two goals are unified into a common language modeling paradigm.

IQ Assessment Benchmark. In addition to the low-cost training paradigm, the team also made another contribution by proposing a systematic set of benchmarks for LLM's intelligence quotient (IQ) assessment.

Previous research has shown that although the perplexity level (PPL) indicator can reflect the quality of generated text to a certain extent, it is not reliable. On the other hand, the scale of LLM training data is so large that it is difficult for us to distinguish whether the model is just quoting knowledge data, or whether it is really achieving human-like reasoning, analysis, and generalization capabilities, which are what this study defines IQ Foundation. Some commonly used evaluation metrics (MMLU for English and C-Eval for Chinese) are obviously knowledge-oriented and cannot fully reflect the intelligence level of the model.

For a sanity check, the team conducted a test: five computer science researchers from world-renowned universities took an exam using C-Eval’s chemistry test questions . It turned out that their accuracy was almost as good as random guessing because most of the volunteers had forgotten what they had learned about chemistry. Therefore, evaluation benchmarks that emphasize knowledge of expertise are not adequate measures of a model's IQ.

To comprehensively measure LLM's IQ, the team developed an IQ assessment benchmark that takes into account four key aspects of IQ: symbol mapping, rule understanding, pattern mining, and Anti-interference.
  • Language is symbolic in nature. There have been some studies using symbols rather than category labels to assess the intelligence level of LLMs. Similarly, the team used a symbolic mapping approach to test the LLM's ability to generalize to unseen contexts.

  • An important ability of human intelligence is to understand given rules and take corresponding actions. This testing method has been widely used in various levels of testing. Therefore, rule understanding becomes the second test here.

  • Rewritten content: Pattern mining is an important part of intelligence, which involves induction and deduction. In the history of scientific development, this method plays a crucial role. In addition, test questions in various competitions often require this ability to answer. For these reasons, we chose pattern mining as the third evaluation indicator

  • The last and very important indicator is the anti-interference ability, which is also one of the core capabilities of intelligence. Studies have pointed out that both language and images are easily disturbed by noise. With this in mind, the team used interference immunity as a final evaluation metric.

Of course, these four indicators are by no means the final word in LLM IQ assessment, but they can serve as a starting point to stimulate subsequent research and development, and are expected to eventually lead to a comprehensive set of LLM IQ assessment framework.

The main contributions of this study include:
  • The researchers stated that this is a study using growth strategies to train more than 1,000 people from scratch. LLM research attempt on billion parameters. At the same time, this is also the lowest cost 100 billion parameter model currently, costing only 100,000 US dollars

  • By improving FreeLM training objectives, potential hyperparameter search methods and function-preserving growth, This study addresses the issue of instability. The researchers believe this method can also help the broader scientific research community.

  • The researchers also conducted experimental comparisons of the new model with previously powerful models, including using knowledge-oriented benchmarks and a newly proposed systematic IQ assessment benchmark. Experimental results show that the FLM-101B model is competitive and robust

  • The team will release model checkpoints, code, related tools, etc. to promote the research and development of bilingual LLM in Chinese and English with a scale of 100 billion parameters.

FLM-101B Design Overview

Architecturally, FLM-101B uses FreeLM as its backbone network, and integrates xPos. In terms of model size, thanks to the new growth strategy, researchers can obtain models of three sizes: 16B, 51B, and 101B in one training.

As for the pre-training settings, FLM-101B inherits the training strategy of FreeLM.

In terms of growth strategy, instead of the common practice of training models of different sizes independently, the team can sequentially train three models with 16B, 51B, and 101B parameters. Each of these models inherits the knowledge of the smaller model before it.

As for the training hardware, a cluster of 24 DGX-A800 GPU (8×80G) servers is used; the training time of FLM-101B is less than 26 days. See Tables 1 and 2 below for multi-parallel strategy and model configurations.

With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born

With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born

Training Stability of FLM-101B

In order to solve unstable problems such as loss divergence and gradient explosion, researchers have proposed a promising solution, which is briefly described as follows.

Loss prediction. The newly proposed method to achieve training stability is as follows:

First, determine the distribution of the data before starting FLM-16B training.

Next, perform a grid search on three hyperparameters, including the learning rate, initialization standard deviation, and softmax temperature of the output layer. The grid search is performed by running a surrogate model with a hidden state dimension (i.e., model width) of 256, a head count of 2, and a parameter count of 40 million. All other structural hyperparameters and training data of this surrogate model are the same as FLM-16B. Using data parallelism on 6 nodes, a grid search run took 24.6 hours, which roughly translates into 6 hours using a 24-node configuration.

Through this grid search, the researchers found the optimal hyperparameters: learning rate = 4e-4, standard deviation = 1.6e-2, softmax temperature = 2.0.

Then they migrate these hyperparameters through µP to achieve a seamless training experience that avoids instability problems. When MSG is used in combination, LM-51B and FLM-101B do not have subsequent growth divergence problems.

Figure 2 shows the complete training loss curve.

With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born

Mixed precision via Bfloat16. The purpose of using mixed precision is to save memory and time costs during runtime. Here they chose Bfloat16.
Benchmark Evaluation

Table 3 compares FLM-101B with others Performance of powerful baseline models (LLAMA series models and GLM-130B).

With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born

The researchers stated that these results indicate that FLM-101B does not have any advantage in factual knowledge, and its performance will continue if more training data can be used. promote.

Table 4 shows the results of eFLM-16B versus the baseline model in terms of expertise assessment.

With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born

It turns out that scores on datasets that emphasize expertise do not reflect the level of intelligence of LLM, as some specific training data may have an overwhelming contribution.

Table 5 shows the performance of each stage of the FLM model.

With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born

As expected, the performance of FLM will improve as the model increases. The FLM-101B performed best on almost every mission. This means that each time the model grows, it inherits the knowledge from the previous stage.
IQ experiment

In the experiment, in order to test the IQ of LLM For a more systematic evaluation, the team from the Intellectual Property Research Institute used existing IQ-related data sets and made some necessary modifications. They also generated some new synthetic data.

Specifically, the IQ assessment they proposed mainly considers four aspects: symbol mapping, rule understanding, pattern mining, and anti-interference. These tasks have one key thing in common: they all rely on reasoning and generalization in new contexts.

The following tables show the results of the IQ experiment:

With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born

With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born

With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born

##From these tables, on these four IQ evaluation benchmarks, FLM-101B achieves results comparable to GPT-3 and better than GLM-130B at a much lower computational cost.

In addition to the impact of training data, the researchers speculate that this advantage may be due to the small model in the early stages refining the smaller search space, and when the model becomes more This advantage continues to play out when larger and wider, and generalization capabilities are enhanced.

The above is the detailed content of With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:机器之心. If there is any infringement, please contact admin@php.cn delete
AI技术加速迭代:周鸿祎视角下的大模型战略AI技术加速迭代:周鸿祎视角下的大模型战略Jun 15, 2023 pm 02:25 PM

今年以来,360集团创始人周鸿祎在所有公开场合的讲话都离不开一个话题,那就是人工智能大模型。他曾自称“GPT的布道者”,对ChatGPT取得的突破赞不绝口,更是坚定看好由此产生的AI技术迭代。作为一个擅于表达的明星企业家,周鸿祎的演讲往往妙语连珠,所以他的“布道”也创造过很多热点话题,确实为AI大模型添了一把火。但对周鸿祎而言,光做意见领袖还不够,外界更关心他执掌的360公司如何应对这波AI新浪潮。事实上,在360内部,周鸿祎也早已掀起一场全员变革,4月份,他发出内部信,要求360每一位员工、每

蚂蚁集团推出金融大模型产品:金融助理“支小宝 2.0”和业务助手“支小助”,已完成备案并即将上线蚂蚁集团推出金融大模型产品:金融助理“支小宝 2.0”和业务助手“支小助”,已完成备案并即将上线Sep 10, 2023 pm 06:13 PM

蚂蚁集团在上海举行的第二届外滩大会上,宣布正式发布旗下的金融大模型据蚂蚁集团介绍,蚂蚁金融大模型基于其自研基础大模型,并针对金融产业进行深度定制,底层算力集群达到万卡规模。目前,该大模型已在蚂蚁集团财富、保险平台全面测试。同时,基于该大模型的两款产品——智能金融助理“支小宝2.0”、服务金融产业专家的智能业务助手“支小助”,也已正式亮相。据介绍,两款大模型产品展示了蚂蚁从基础大模型到行业大模型以及产业应用的全栈布局和进展。本站附两款产品目前进度如下:“支小宝2.0”已经开始内测近半年,将在完成相

百度CIO李莹:大模型是企业办公领域的重要机遇,AI的原生重构将改变智能工作方式百度CIO李莹:大模型是企业办公领域的重要机遇,AI的原生重构将改变智能工作方式Aug 18, 2023 pm 11:49 PM

2023年8月16日,WAVESUMMIT深度学习开发者大会在中国举办,该活动由深度学习技术及应用国家工程研究中心主办,百度飞桨和文心大模型承办。在会上,百度发布了文心大模型、飞桨平台和AI原生应用如流等一系列技术、产品的最新进展和生态成果。百度集团副总裁兼首席信息官李莹发表了主题演讲,她认为当前以AI大模型为核心技术的第四次科技革命将从根本上推动生产力变革,为各行各业提供强大支持,并为企业办公领域带来前所未有的发展机遇基于AI原生思维,李莹宣布,百度智能工作知识管理理念“创新流水线=AIx知识

腾讯汤道生:大模型只是起点,产业落地是AI更大的应用场景腾讯汤道生:大模型只是起点,产业落地是AI更大的应用场景Jun 22, 2023 pm 04:18 PM

6月21日,北大光华管理学院联合腾讯,宣布升级“数字中国筑塔计划”,共同推出“企业管理者人工智能通识课”系列课程。在第一课上,腾讯集团高级执行副总裁、云与智慧产业事业群CEO汤道生回顾了AI发展的历史,表示算法创新、算力增强、开源共创三大因素的叠加,构成了AI的“增长飞轮”。大模型的快速进步,推动我们正在进入一个被AI重塑的时代。汤道生表示,大模型只是起点,未来,应用落地的产业变革是更大的图景。企业过去的研发、生产、销售、服务等环节中,有很多依赖人来判断、协调与沟通的地方,今天都值得去看看,哪些

华为小艺AI助手将实现强大的大模型能力华为小艺AI助手将实现强大的大模型能力Aug 15, 2023 pm 12:05 PM

华为手机官方微博在8月4日宣布,通过盘古大模型的底层能力,HarmonyOS将为小艺带来更强大的AI能力

科大讯飞人工智能大模型升级,另有6家上市公司也已布局大模型科大讯飞人工智能大模型升级,另有6家上市公司也已布局大模型Jun 10, 2023 am 08:11 AM

近日,科大讯飞公告其构建的“讯飞星火认知大模型”将举行升级发布会,推出该人工智能大模型的V1.5(1.5版本)。此前,朗玛信息也因推出“朗玛•39AI全科医生”大模型产品举行发布会。此外,还有5家上市公司也在与投资者沟通交流中,披露已布局AI(人工智能)大模型的信息。来源:摄图网科大讯飞的“讯飞星火认知大模型”升级至1.5版近日,科大讯飞股份有限公司(证券简称:科大讯飞;证券代码:002230.SZ)披露了《关于讯飞星火认知大模型升级发布会的提示性公告》。公告显示,2023年5月6日,科大讯飞举

多家企业发布基于大模型的AI产品,大模型应用落地哪家强?多家企业发布基于大模型的AI产品,大模型应用落地哪家强?Jun 03, 2023 pm 09:56 PM

“无产业不AI,无应用不AI。”随着AI(人工智能)大模型技术落地,AI应用遍地开花。连日来,多家企业发布基于大模型的AI应用产品。身处“百模大战”时代,如何打造国产大模型应用产品?如何为大模型提供更普惠的算力、寻找更合适的场景?发布现场图。6月1日,阿里云对外披露通义大模型最新进展,上线聚焦音视频内容的AI新品“通义听悟”,成为国内首个开放公测的大模型应用产品。有专家认为,云计算是打造大模型最合适的形式,而大模型的进化过程,或将会对传统云计算架构开始新一轮的改造。阿里云AI新产品“通义听悟”开

成功孵化首个大型模型解决方案的重庆人工智能创新中心成功孵化首个大型模型解决方案的重庆人工智能创新中心Aug 06, 2023 pm 09:01 PM

最近,重庆人工智能创新中心成功孵化了云从科技的首个大模型解决方案,名为“从容大模型训推一体机”,并已成功部署。作为国内最早布局大模型的云服务商之一,华为不仅致力于深耕算力,打造强大的算力基础设施来支持中国人工智能事业的发展,而且还着眼于通用大模型和行业大模型,真正实现为千行百业和科学研究提供优质的人工智能服务经过重庆人工智能创新中心技术团队、昇腾研发专家和云从科技人工智能研究院的共同努力,一个月内顺利完成了“从容大模型训推一体机”的精度与性能对齐、产品集成与测试等工作,这成为了重庆人工智能创新中

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),