Generally, the deployment of large language models usually adopts the "pre-training-fine-tuning" method. However, when fine-tuning the underlying model for multiple tasks (such as personalized assistants), the cost of training and serving becomes very high. LowRank Adaptation (LoRA) is an efficient parameter fine-tuning method, which is usually used to adapt the basic model to multiple tasks, thereby generating a large number of derived LoRA adapters
Rewrite: Batch inference provides many opportunities during serving, and this pattern has been shown to achieve comparable performance to full fine-tuning by fine-tuning adapter weights. While this approach enables low-latency single-adapter inference and serial execution across adapters, it significantly reduces overall service throughput and increases overall latency when serving multiple adapters simultaneously. Therefore, how to solve the large-scale service problem of these fine-tuned variants remains unknown
Recently, researchers from UC Berkeley, Stanford and other universities proposed in a paper a method called New fine-tuning method for S-LoRA
- Paper address: https://arxiv.org/pdf/2311.03285.pdf
- Project address: https://github.com/S-LoRA/S-LoRA
S -LoRA is a system designed for scalable serving of many LoRA adapters, which stores all adapters in main memory and fetches the adapter used by the currently running query into GPU memory.
S-LoRA proposes "Unified Paging" technology, which uses a unified memory pool to manage different levels of dynamic adapter weights and KV cache tensors of different sequence lengths . Additionally, S-LoRA employs a new tensor parallelism strategy and highly optimized custom CUDA kernels to enable heterogeneous batch processing of LoRA computations.
These features allow S-LoRA to serve thousands of LoRA adapters (2000 adapters simultaneously) on single or multiple GPUs at a fraction of the cost, and will Additional LoRA computation costs are minimized. In contrast, vLLM-packed requires maintaining multiple copies of weights and can only serve fewer than 5 adapters due to GPU memory limitations
Unlike HuggingFace PEFT and vLLM (only Compared with state-of-the-art libraries such as the LoRA service), S-LoRA can increase throughput by up to 4 times, and the number of service adapters can be increased by several orders of magnitude. Therefore, S-LoRA is able to provide scalable services for many task-specific fine-tuning models and offers the potential for large-scale customization of fine-tuning services.
S-LoRA contains three main innovative parts. Section 4 introduces the batching strategy used to decompose the calculations between the base model and the LoRA adapter. In addition, the researchers also solved the challenges of demand scheduling, including aspects such as adapter clustering and admission control. The ability to batch across concurrent adapters brings new challenges to memory management. In the fifth part, researchers promote PagedAttention to Unfied Paging to support dynamic loading of LoRA adapters. This approach uses a unified memory pool to store the KV cache and adapter weights in a paged manner, which can reduce fragmentation and balance the dynamically changing sizes of the KV cache and adapter weights. Finally, Part Six introduces a new tensor parallel strategy that can efficiently decouple the base model and the LoRA adapter
The following is the key content:
Batch Processing
For a single adapter, Hu et al. (2021) proposed a recommended method, which is to merge the adapter weights with the base model weights to obtain a new model ( See equation 1). The benefit of this is that there is no additional adapter overhead during inference since the new model has the same number of parameters as the base model. In fact, this was also a distinctive feature of the initial LoRA work
##This article points out that merging the LoRA adapter into the base model Medium is inefficient for multi-LoRA high-throughput service setups. Instead, the researchers propose to calculate LoRA in real time to calculate xAB (as shown in Equation 2).
In S-LoRA, computing the base model is batched and then additional xAB is performed for all adapters individually using a custom CUDA kernel. This process is shown in Figure 1. Instead of using padding and batched GEMM kernels from the BLAS library to compute LoRA, we implemented a custom CUDA kernel to achieve more efficient computation without padding, implementation details are in subsection 5.3.
If LoRA adapters are stored in main memory, their number can be large, but the number of LoRA adapters currently required to run a batch is Controllable since batch size is limited by GPU memory. To take advantage of this, we store all LoRA adapters in main memory and, when inferring for the currently running batch, fetch only the LoRA adapters required for that batch into GPU RAM. . In this case, the maximum number of serviceable adapters is limited by the main memory size. Figure 2 illustrates this process. Section 5 also discusses techniques for efficient memory management
Memory management
with a single base Compared with servicing models, servicing multiple LoRA adapter cards simultaneously will bring new memory management challenges. To support multiple adapters, S-LoRA stores them in main memory and dynamically loads the adapter weights required for the current running batch into GPU RAM.
In this process, there are two obvious challenges. The first is the memory fragmentation issue, which is caused by dynamically loading and unloading adapter weights of different sizes. The second is the latency overhead caused by adapter loading and unloading. In order to effectively solve these problems, researchers proposed the concept of "unified paging" and realized the overlap of I/O and calculation by prefetching adapter weights
Unified Paging
Researchers expanded the concept of PagedAttention into Unified Paging. Unified paging is used not only to manage KV cache, but also to manage adapter weights. Unified paging uses a unified memory pool to jointly manage KV cache and adapter weights. To achieve this, they first statically allocate a large buffer to the memory pool, which utilizes all available space except for the space used to store the base model weights and temporary activation tensors. The KV cache and adapter weights are stored in the memory pool in a paged manner, and each page corresponds to an H vector. Therefore, a KV cache tensor with sequence length S occupies S pages, while an R-level LoRA weight tensor occupies R pages. Figure 3 shows the layout of the memory pool, where the KV cache and adapter weights are stored in an interleaved and non-contiguous manner. This approach greatly reduces fragmentation and ensures that different levels of adapter weights can co-exist with the dynamic KV cache in a structured and systematic way
Tensor Parallel
In addition, the researchers designed a novel tensor parallel strategy for batch LoRA inference to support multi-GPU inference of large Transformer models. Tensor parallelism is the most widely used parallel approach because its single-program, multiple-data paradigm simplifies its implementation and integration with existing systems. Tensor parallelism can reduce memory usage and latency per GPU when serving large models. In this setting, additional LoRA adapters introduce new weight matrices and matrix multiplications, which require new partitioning strategies for these additions.
Evaluation
Finally, the researchers passed the test for Llama-7B/13B/30B/70B Serving to evaluate S-LoRA
The results show that S-LoRA can serve thousands of LoRA adapters on a single GPU or multiple GPUs , and the overhead is very small. S-LoRA achieves up to 30x higher throughput compared to Huggingface PEFT, a state-of-the-art parameter-efficient fine-tuning library. S-LoRA increases throughput by 4x and increases the number of service adapters by several orders of magnitude compared to using a high-throughput service system vLLM that supports LoRA services.
For more research details, please refer to the original paper.
The above is the detailed content of S-LoRA: It is possible to run thousands of large models on one GPU. For more information, please follow other related articles on the PHP Chinese website!

Running large language models at home with ease: LM Studio User Guide In recent years, advances in software and hardware have made it possible to run large language models (LLMs) on personal computers. LM Studio is an excellent tool to make this process easy and convenient. This article will dive into how to run LLM locally using LM Studio, covering key steps, potential challenges, and the benefits of having LLM locally. Whether you are a tech enthusiast or are curious about the latest AI technologies, this guide will provide valuable insights and practical tips. Let's get started! Overview Understand the basic requirements for running LLM locally. Set up LM Studi on your computer

Guy Peri is McCormick’s Chief Information and Digital Officer. Though only seven months into his role, Peri is rapidly advancing a comprehensive transformation of the company’s digital capabilities. His career-long focus on data and analytics informs

Introduction Artificial intelligence (AI) is evolving to understand not just words, but also emotions, responding with a human touch. This sophisticated interaction is crucial in the rapidly advancing field of AI and natural language processing. Th

Introduction In today's data-centric world, leveraging advanced AI technologies is crucial for businesses seeking a competitive edge and enhanced efficiency. A range of powerful tools empowers data scientists, analysts, and developers to build, depl

This week's AI landscape exploded with groundbreaking releases from industry giants like OpenAI, Mistral AI, NVIDIA, DeepSeek, and Hugging Face. These new models promise increased power, affordability, and accessibility, fueled by advancements in tr

But the company’s Android app, which offers not only search capabilities but also acts as an AI assistant, is riddled with a host of security issues that could expose its users to data theft, account takeovers and impersonation attacks from malicious

You can look at what’s happening in conferences and at trade shows. You can ask engineers what they’re doing, or consult with a CEO. Everywhere you look, things are changing at breakneck speed. Engineers, and Non-Engineers What’s the difference be

Simulate Rocket Launches with RocketPy: A Comprehensive Guide This article guides you through simulating high-power rocket launches using RocketPy, a powerful Python library. We'll cover everything from defining rocket components to analyzing simula


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function

Zend Studio 13.0.1
Powerful PHP integrated development environment

SublimeText3 English version
Recommended: Win version, supports code prompts!

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool