Intel® What is Extension for Transformer?
Intel® Extension for Transformers[1] is an innovative toolkit launched by Intel that can be based on the Intel® architecture platform, especially the fourth generation Intel® Xeon® Scalable processors (codenamed Sapphire Rapids[2], SPR) significantly accelerate Transformer-based Large Language Model (LLM). Its main features include:
- Provide users with a seamless model compression experience by extending the Hugging Face transformers API[3] and leveraging Intel® Neural Compressor[4];
- Provides LLM inference runtime using low-bit quantization kernel (NeurIPS 2023: Efficient LLM inference on CPU [5]), supporting Falcon, LLaMA, MPT, Llama2, BLOOM, OPT, ChatGLM2, GPT-J- Common LLMs such as 6B, Baichuan-13B-Base, Baichuan2-13B-Base, Qwen-7B, Qwen-14B and Dolly-v2-3B[6];
- Advanced compressed sensing runtime[7] (NeurIPS 2022: Fast Distillation on CPU and QuaLA-MiniLM: Quantization Length Adaptive MiniLM; NeurIPS 2021: Prune once, forget it: sparse/prune pre-trained language models).
This article will focus on the LLM inference runtime (referred to as "LLM runtime") , and how to use the Transformer-based API to run on Intel® Xeon ® Achieve more efficient LLM reasoning on scalable processors and how to deal with the application problems of LLM in chat scenarios.
LLM RuntimeIntel® The LLM Runtime[8] provided by Extension for Transformers is a lightweight but efficient LLM inference runtime , which is inspired by GGML[9] and is compatible with llama.cpp[10]. It has the following characteristics:
- The kernel has been built-in for
- Intel® Xeon® CPU Multiple AI acceleration technologies (such as AMX, VNNI) and AVX512F and AVX2 instruction sets have been optimized; can provide more quantization options, such as: different granularity (by channel or by group), different groups Size (such as: 32/128);
- has better KV cache access and memory allocation strategy;
- has tensor parallelization function, which can help distribution in multi-channel systems reasoning.
from transformers import AutoTokenizer, TextStreamerfrom intel_extension_for_transformers.transformers import AutoModelForCausalLMmodel_name = "Intel/neural-chat-7b-v3-1” prompt = "Once upon a time, there existed a little girl,"tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)inputs = tokenizer(prompt, return_tensors="pt").input_idsstreamer = TextStreamer(tokenizer)model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)The default setting is: store weights as 4 bits and perform calculations as 8 bits. But it also supports different calculation data type (dtype) and weight data type combinations, and users can modify the settings as needed. Sample code for how to use this feature is provided below:
from transformers import AutoTokenizer, TextStreamerfrom intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfigmodel_name = "Intel/neural-chat-7b-v3-1” prompt = "Once upon a time, there existed a little girl,"woq_config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4")tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)inputs = tokenizer(prompt, return_tensors="pt").input_idsstreamer = TextStreamer(tokenizer)model = AutoModelForCausalLM.from_pretrained(model_name,quantization_cnotallow=woq_config)outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)Performance TestAfter continuous efforts, the INT4 performance of the above optimization scheme has been significantly improved. This article compares the performance with llama.cpp on a system equipped with
Intel® 256 GB total memory (16 x 16 GB DDR5 4800 MT/s [4800 MT/s]), BIOS 3A14.TEL2P1, microcode 0x2b0001b0, CentOS Stream 8. The results of the inference performance test are shown in the table below, where the input size is 32, the output size is 32, and the beam is 1
##△ Table 1. Comparison of inference performance between LLM Runtime and llama.cpp (input size = 32, output size = 32, beam = 1)
Inference performance when the input size is 1024, the output size is 32, and the beam is 1 The test results are detailed in the following table:
△Table 2. LLM Runtime and llama.cpp inference performance comparison (input size = 1024, output size=32,beam=1)
According to Table 2 above: Compared with llama.cpp also running on the fourth generation Intel® Xeon® scalable processor, whether it is the first token or the next token, LLM Runtime can significantly reduce latency, and the inference speed of the first token and the next token is increased by up to 40 times[a] (Baichuan-13B, input is 1024) and 2.68 times [ b] (MPT-7B, input is 1024). The test of llama.cpp uses the default code base [10].
Combining the test results in Table 1 and Table 2, we can get: Compared with llama.cpp also running on the fourth generation Intel® Xeon® scalable processor, LLM Runtime can significantly improve the overall performance of many common LLMs: when the input size is 1024, it achieves an improvement of 3.58 to 21.5 times; when the input size is 32, it achieves an improvement of 1.76 to 3.43 times[c] .
Accuracy Test
Intel® Extension for Transformers available Intel® SignRound[11], RTN and GPTQ[12] in Neural Compressor ] and other quantification methods, and verified the INT4 inference accuracy using lambada_openai, piqa, winogrande and hellaswag data sets. The table below compares test result averages to FP32 accuracy.
△Table 3. Accuracy comparison between INT4 and FP32
As can be seen from Table 3 above, the INT4 inference performed by multiple models based on LLM Runtime is accurate The sexual loss is so small that it can almost be ignored. We verified many models, but only some are listed here due to space limitations. If you would like more information or details, please visit this link: https://medium.com/@NeuralCompressor/llm-performance-of-intel-extension-for-transformers-f7d061556176.
More advanced functions: meet the application needs of LLM in more scenarios
At the same time, LLM Runtime[8] also has the tensor parallelization function of dual-channel CPU, which is the first to have such a function one of the products. In the future, dual nodes will be further supported.
However, the advantage of LLM Runtime is not only its better performance and accuracy. We have also invested a lot of effort to enhance its functions in chat application scenarios and solve the problems that LLM may encounter in chat scenarios. The following application problems are encountered:
- Dialogue is not only related to LLM reasoning, but dialogue history is also very useful.
- Limited output length: LLM model pre-training is mainly based on limited sequence length. Therefore, its accuracy decreases when the sequence length exceeds the attention window size used during pre-training.
- Inefficiency: During the decoding stage, Transformer-based LLM will store the key-value status (KV) of all previously generated tokens, resulting in excessive memory usage and increased decoding latency.
Regarding the first issue, LLM Runtime's dialogue function is solved by incorporating more dialogue history data and generating more output, which llama.cpp is not yet well equipped to deal with. .
Regarding the second and third questions, we integrated streaming LLM (Steaming LLM) into Intel® Extension for Transformers, which can significantly optimize memory usage and reduce inference time extension.
Streaming LLM
Different from the traditional KV caching algorithm, our method combines Attention Sink (4 initial tokens) to improve attention calculation Stability, and retaining the latest token with the help of rolling KV cache, which is crucial for language modeling. The design is highly flexible and can be seamlessly integrated into autoregressive language models capable of utilizing rotational position encoding RoPE and relative position encoding ALiBi.
The content that needs to be rewritten is: △ Figure 2. KV cache of Steam LLM using attention sinking to implement efficient streaming language model (Picture source: [13] )
Moreover, unlike llama.cpp, this optimization plan also adds new parameters such as "n_keep" and "n_discard" to enhance the Streaming LLM strategy. Users can use the "n_keep" parameter to specify the number of tokens to keep in the KV cache, and the "n_discard" parameter to determine the number to discard among the generated tokens. In order to better balance performance and accuracy, the system discards half of the latest token number in the KV cache by default
At the same time, to further improve performance, we have also added Streaming LLM to the MHA fusion mode. If the model uses rotational position encoding (RoPE) to implement position embedding, then you only need to apply a "shift operation" to the existing K-Cache to avoid performing operations on previously generated tokens that have not been discarded. Repeated calculation. This method not only takes full advantage of the full context size when generating long text, but also does not incur additional overhead until the KV cache context is completely filled.
“shift operation”依赖于旋转的交换性和关联性,或复数乘法。例如:如果某个token的K-张量初始放置位置为m并且旋转了m×θi for i ∈ [0,d/2),那么当它需要移动到m-1这个位置时,则可以旋转回到(-1)×θi for i ∈ [0,d/2)。这正是每次舍弃n_discard个token的缓存时发生的事情,而此时剩余的每个token都需要“移动”n_discard个位置。下图以“n_keep=4、n_ctx=16、n_discard=1”为例,展示了这一过程。
△图3.Ring-Buffer KV-Cache和Shift-RoPE工作原理
需要注意的是:融合注意力层无需了解上述过程。如果对K-cache和V-cache进行相同的洗牌,注意力层会输出几乎相同的结果(可能存在因浮点误差导致的微小差异)。
您可以使用下面的代码来启动Streaming LLM:
from transformers import AutoTokenizer, TextStreamer from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig model_name = "Intel/neural-chat-7b-v1-1" # Hugging Face model_id or local model woq_config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4") prompt = "Once upon a time, a little girl"tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) inputs = tokenizer(prompt, return_tensors="pt").input_ids streamer = TextStreamer(tokenizer)model = AutoModelForCausalLM.from_pretrained(model_name, quantization_cnotallow=woq_config, trust_remote_code=True) # Recommend n_keep=4 to do attention sinks (four initial tokens) and n_discard=-1 to drop half rencetly tokens when meet length threshold outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300, ctx_size=100, n_keep=4, n_discard=-1)
结论与展望
本文基于上述实践经验,提供了一个在英特尔® 至强® 可扩展处理器上实现高效的低位(INT4)LLM推理的解决方案,并且在一系列常见LLM上验证了其通用性以及展现了其相对于其他基于CPU的开源解决方案的性能优势。未来,我们还将进一步提升CPU张量库和跨节点并行性能。
欢迎您试用英特尔® Extension for Transformers[1],并在英特尔® 平台上更高效地运行LLM推理!也欢迎您向代码仓库(repository)提交修改请求 (pull request)、问题或疑问。期待您的反馈!
特别致谢
在此致谢为此篇文章做出贡献的英特尔公司人工智能资深经理张瀚文及工程师许震中、余振滔、刘振卫、丁艺、王哲、刘宇澄。
[a]根据表2 Baichuan-13B的首个token测试结果计算而得。
[b]根据表2 MPT-7B的下一个token测试结果计算而得。
[c]当输入大小为1024时,整体性能=首个token性能+1023下一个token性能;当输入大小为32时,整体性能=首个token性能+31下一个token性能。
The above is the detailed content of Improve large model inference performance by 40x using toolkit. For more information, please follow other related articles on the PHP Chinese website!

译者 | 布加迪审校 | 孙淑娟目前,没有用于构建和管理机器学习(ML)应用程序的标准实践。机器学习项目组织得不好,缺乏可重复性,而且从长远来看容易彻底失败。因此,我们需要一套流程来帮助自己在整个机器学习生命周期中保持质量、可持续性、稳健性和成本管理。图1. 机器学习开发生命周期流程使用质量保证方法开发机器学习应用程序的跨行业标准流程(CRISP-ML(Q))是CRISP-DM的升级版,以确保机器学习产品的质量。CRISP-ML(Q)有六个单独的阶段:1. 业务和数据理解2. 数据准备3. 模型

人工智能(AI)在流行文化和政治分析中经常以两种极端的形式出现。它要么代表着人类智慧与科技实力相结合的未来主义乌托邦的关键,要么是迈向反乌托邦式机器崛起的第一步。学者、企业家、甚至活动家在应用人工智能应对气候变化时都采用了同样的二元思维。科技行业对人工智能在创建一个新的技术乌托邦中所扮演的角色的单一关注,掩盖了人工智能可能加剧环境退化的方式,通常是直接伤害边缘人群的方式。为了在应对气候变化的过程中充分利用人工智能技术,同时承认其大量消耗能源,引领人工智能潮流的科技公司需要探索人工智能对环境影响的

Wav2vec 2.0 [1],HuBERT [2] 和 WavLM [3] 等语音预训练模型,通过在多达上万小时的无标注语音数据(如 Libri-light )上的自监督学习,显著提升了自动语音识别(Automatic Speech Recognition, ASR),语音合成(Text-to-speech, TTS)和语音转换(Voice Conversation,VC)等语音下游任务的性能。然而这些模型都没有公开的中文版本,不便于应用在中文语音研究场景。 WenetSpeech [4] 是

条形统计图用“直条”呈现数据。条形统计图是用一个单位长度表示一定的数量,根据数量的多少画成长短不同的直条,然后把这些直条按一定的顺序排列起来;从条形统计图中很容易看出各种数量的多少。条形统计图分为:单式条形统计图和复式条形统计图,前者只表示1个项目的数据,后者可以同时表示多个项目的数据。

arXiv论文“Sim-to-Real Domain Adaptation for Lane Detection and Classification in Autonomous Driving“,2022年5月,加拿大滑铁卢大学的工作。虽然自主驾驶的监督检测和分类框架需要大型标注数据集,但光照真实模拟环境生成的合成数据推动的无监督域适应(UDA,Unsupervised Domain Adaptation)方法则是低成本、耗时更少的解决方案。本文提出对抗性鉴别和生成(adversarial d

数据通信中的信道传输速率单位是bps,它表示“位/秒”或“比特/秒”,即数据传输速率在数值上等于每秒钟传输构成数据代码的二进制比特数,也称“比特率”。比特率表示单位时间内传送比特的数目,用于衡量数字信息的传送速度;根据每帧图像存储时所占的比特数和传输比特率,可以计算数字图像信息传输的速度。

数据分析方法有4种,分别是:1、趋势分析,趋势分析一般用于核心指标的长期跟踪;2、象限分析,可依据数据的不同,将各个比较主体划分到四个象限中;3、对比分析,分为横向对比和纵向对比;4、交叉分析,主要作用就是从多个维度细分数据。

在日常开发中,对数据进行序列化和反序列化是常见的数据操作,Python提供了两个模块方便开发者实现数据的序列化操作,即 json 模块和 pickle 模块。这两个模块主要区别如下:json 是一个文本序列化格式,而 pickle 是一个二进制序列化格式;json 是我们可以直观阅读的,而 pickle 不可以;json 是可互操作的,在 Python 系统之外广泛使用,而 pickle 则是 Python 专用的;默认情况下,json 只能表示 Python 内置类型的子集,不能表示自定义的


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Atom editor mac version download
The most popular open source editor
