Home >Technology peripherals >AI >Use BigDL-LLM to instantly accelerate tens of billions of parameter LLM inferences
We are entering a new era of AI driven by Large Language Model (LLM). LLM is playing an increasingly important role in various applications such as customer service, virtual assistants, content creation, and programming assistance. role.
However, as the scale of LLM continues to expand, the resource consumption required to run large models is also increasing, causing it to run slower and slower, which brings considerable challenges to AI application developers. challenge.
To this end, Intel recently launched a large model open source library called BigDL-LLM[1], which can help AI developers and researchers in Intel ® accelerates the optimization of large language models on the platform and improves the experience of using large language models on the Intel ® platform.
The following shows the 33 billion parameter large language model Vicuna-33b-v1.3[2] accelerated using BigDL-LLM Real-time effects running on a server equipped with Intel® Xeon® Platinum 8468 processor.
△Running a 33 billion parameter large language on a server equipped with Intel® Xeon® Platinum 8468 processor Actual speed of the model (real-time screen recording)
BigDL-LLM is a library focused on optimization and an open source library for accelerating large language models. It is part of BigDL and released under the Apache 2.0 license
It provides various low-precision optimizations (such as INT4/INT5/INT8) and can leverage a variety of Intel® CPU integrated hardware acceleration technology (AVX/VNNI/AMX, etc.) and the latest software optimization enable large language models to achieve more efficient optimization on the Intel® platform and run faster.
An important feature of BigDL-LLM is that for models based on the Hugging Face Transformers API, you only need to change one line of code to accelerate the model. In theory, it can support running any Transformers model, which is very friendly to developers who are familiar with the Transformers API.
In addition to Transformers API, many people also use LangChain to develop large language model applications.
To this end, BigDL-LLM also provides easy-to-use LangChain integration[3], allowing developers to easily use BigDL-LLM to develop new applications or migrate existing, Applications based on Transformers API or LangChain API.
In addition, for general PyTorch large language models (models that do not use Transformer or LangChain API), you can also use BigDL-LLM optimize_model API one-click acceleration to improve performance. For details, please refer to GitHub README[4] and official documentation[5].
BigDL-LLM also provides a large number of commonly used open source LLM acceleration examples (e.g. examples using Transformers API[6] and examples using LangChain API[7] , and tutorials (including supporting jupyter notebooks) [8], to facilitate developers to quickly get started.
Installing BigDL-LLM is very convenient, just execute the following command:
pip install --pre --upgrade bigdl-llm[all]
△If the code is not fully displayed, please leave Sliding
It is also very easy to use BigDL-LLM to accelerate large models (only the Transformers style API is used as an example here).
Use BigDL-LLM Transformer style API to accelerate the model , only the model loading part needs to be changed, and the subsequent use process is completely consistent with the native Transformers.
The method of loading the model using the BigDL-LLM API is almost the same as the Transformers API - the user only needs to change the import, in the from_pretrained parameter Just set load_in_4bit=True .
BigDL-LLM will perform 4-bit low-precision quantization during the model loading process and use it in the subsequent inference process Various software and hardware acceleration technologies are optimized
#Load Hugging Face Transformers model with INT4 optimizationsfrom bigdl.llm. transformers import AutoModelForCausalLMmodel = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
△If the code is not fully displayed, please slide left or right
下文将以 LLM 常见应用场景“语音助手”为例,展示采用 BigDL-LLM 快速实现 LLM 应用的案例。通常情况下,语音助手应用的工作流程分为以下两个部分:
以下是本文使用 BigDL-LLM 和 LangChain[11] 来搭建语音助手应用的过程:
在语音识别阶段:第一步,加载预处理器 processor 和语音识别模型 recog_model。本示例中使用的识别模型 Whisper 是一个 Transformers 模型。
只需使用 BigDL-LLM 中的 AutoModelForSpeechSeq2Seq 并设置参数 load_in_4bit=True,就能够以 INT4 精度加载并加速这一模型,从而显著缩短模型推理用时。
#processor = WhisperProcessor .from_pretrained(recog_model_path)recog_model = AutoModelForSpeechSeq2Seq .from_pretrained(recog_model_path, load_in_4bit=True)
△若代码显示不全,请左右滑动
第二步,进行语音识别。首先使用处理器从输入语音中提取输入特征,然后使用识别模型预测 token,并再次使用处理器将 token 解码为自然语言文本。
input_features = processor(frame_data,sampling_rate=audio.sample_rate,return_tensor=“pt”).input_featurespredicted_ids = recogn_model.generate(input_features, forced_decoder_ids=forced_decoder_ids)text = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
△若代码显示不全,请左右滑动
在文本生成阶段,首先使用 BigDL-LLM 的 TransformersLLM API 创建一个 LangChain 语言模型(TransformersLLM 是在 BigDL-LLM 中定义的语言链 LLM 集成)。
可以使用这个 API 来加载 Hugging Face Transformers 的任何模型
llm = TransformersLLM . from_model_id(model_id=llm_model_path,model_kwargs={"temperature": 0, "max_length": args.max_length, "trust_remote_code": True},)
△若代码显示不全,请左右滑动
然后,创建一个正常的对话链 LLMChain,并将已经创建的 llm 设置为输入参数。
# The following code is complete the same as the use-casevoiceassistant_chain = LLMChain(llm=llm, prompt=prompt,verbose=True,memory=ConversationBufferWindowMemory(k=2),)
△若代码显示不全,请左右滑动
以下代码将使用一个链条来记录所有对话历史,并将其适当地格式化为大型语言模型的输入。这样,我们可以生成合适的回复。只需将识别模型生成的文本作为 "human_input" 输入即可。代码如下:
response_text = voiceassistant_chain .predict(human_input=text, stop=”\n\n”)
△若代码显示不全,请左右滑动
最后,将语音识别和文本生成步骤放入循环中,即可在多轮对话中与该“语音助手”交谈。您可访问底部 [12] 链接,查看完整的示例代码,并使用自己的电脑进行尝试。快用 BigDL-LLM 来快速搭建自己的语音助手吧!
黄晟盛是英特尔公司的资深架构师,黄凯是英特尔公司的AI框架工程师,戴金权是英特尔院士、大数据技术全球CTO和BigDL项目的创始人,他们都从事着与大数据和AI相关的工作
The above is the detailed content of Use BigDL-LLM to instantly accelerate tens of billions of parameter LLM inferences. For more information, please follow other related articles on the PHP Chinese website!