search
HomeTechnology peripheralsAIS-LoRA: It is possible to run thousands of large models on one GPU

Generally, the deployment of large language models usually adopts the "pre-training-fine-tuning" method. However, when fine-tuning the underlying model for multiple tasks (such as personalized assistants), the cost of training and serving becomes very high. LowRank Adaptation (LoRA) is an efficient parameter fine-tuning method, which is usually used to adapt the basic model to multiple tasks, thereby generating a large number of derived LoRA adapters

Rewrite: Batch inference provides many opportunities during serving, and this pattern has been shown to achieve comparable performance to full fine-tuning by fine-tuning adapter weights. While this approach enables low-latency single-adapter inference and serial execution across adapters, it significantly reduces overall service throughput and increases overall latency when serving multiple adapters simultaneously. Therefore, how to solve the large-scale service problem of these fine-tuned variants remains unknown

Recently, researchers from UC Berkeley, Stanford and other universities proposed in a paper a method called New fine-tuning method for S-LoRA

S-LoRA: It is possible to run thousands of large models on one GPU

  • Paper address: https://arxiv.org/pdf/2311.03285.pdf
  • Project address: https://github.com/S-LoRA/S-LoRA

S -LoRA is a system designed for scalable serving of many LoRA adapters, which stores all adapters in main memory and fetches the adapter used by the currently running query into GPU memory.

S-LoRA proposes "Unified Paging" technology, which uses a unified memory pool to manage different levels of dynamic adapter weights and KV cache tensors of different sequence lengths . Additionally, S-LoRA employs a new tensor parallelism strategy and highly optimized custom CUDA kernels to enable heterogeneous batch processing of LoRA computations.

These features allow S-LoRA to serve thousands of LoRA adapters (2000 adapters simultaneously) on single or multiple GPUs at a fraction of the cost, and will Additional LoRA computation costs are minimized. In contrast, vLLM-packed requires maintaining multiple copies of weights and can only serve fewer than 5 adapters due to GPU memory limitations

Unlike HuggingFace PEFT and vLLM (only Compared with state-of-the-art libraries such as the LoRA service), S-LoRA can increase throughput by up to 4 times, and the number of service adapters can be increased by several orders of magnitude. Therefore, S-LoRA is able to provide scalable services for many task-specific fine-tuning models and offers the potential for large-scale customization of fine-tuning services.

S-LoRA: It is possible to run thousands of large models on one GPU

S-LoRA contains three main innovative parts. Section 4 introduces the batching strategy used to decompose the calculations between the base model and the LoRA adapter. In addition, the researchers also solved the challenges of demand scheduling, including aspects such as adapter clustering and admission control. The ability to batch across concurrent adapters brings new challenges to memory management. In the fifth part, researchers promote PagedAttention to Unfied Paging to support dynamic loading of LoRA adapters. This approach uses a unified memory pool to store the KV cache and adapter weights in a paged manner, which can reduce fragmentation and balance the dynamically changing sizes of the KV cache and adapter weights. Finally, Part Six introduces a new tensor parallel strategy that can efficiently decouple the base model and the LoRA adapter

The following is the key content:

Batch Processing

For a single adapter, Hu et al. (2021) proposed a recommended method, which is to merge the adapter weights with the base model weights to obtain a new model ( See equation 1). The benefit of this is that there is no additional adapter overhead during inference since the new model has the same number of parameters as the base model. In fact, this was also a distinctive feature of the initial LoRA work

S-LoRA: It is possible to run thousands of large models on one GPU

##This article points out that merging the LoRA adapter into the base model Medium is inefficient for multi-LoRA high-throughput service setups. Instead, the researchers propose to calculate LoRA in real time to calculate xAB (as shown in Equation 2).

In S-LoRA, computing the base model is batched and then additional xAB is performed for all adapters individually using a custom CUDA kernel. This process is shown in Figure 1. Instead of using padding and batched GEMM kernels from the BLAS library to compute LoRA, we implemented a custom CUDA kernel to achieve more efficient computation without padding, implementation details are in subsection 5.3.

S-LoRA: It is possible to run thousands of large models on one GPU

If LoRA adapters are stored in main memory, their number can be large, but the number of LoRA adapters currently required to run a batch is Controllable since batch size is limited by GPU memory. To take advantage of this, we store all LoRA adapters in main memory and, when inferring for the currently running batch, fetch only the LoRA adapters required for that batch into GPU RAM. . In this case, the maximum number of serviceable adapters is limited by the main memory size. Figure 2 illustrates this process. Section 5 also discusses techniques for efficient memory management

S-LoRA: It is possible to run thousands of large models on one GPU

Memory management

with a single base Compared with servicing models, servicing multiple LoRA adapter cards simultaneously will bring new memory management challenges. To support multiple adapters, S-LoRA stores them in main memory and dynamically loads the adapter weights required for the current running batch into GPU RAM.

In this process, there are two obvious challenges. The first is the memory fragmentation issue, which is caused by dynamically loading and unloading adapter weights of different sizes. The second is the latency overhead caused by adapter loading and unloading. In order to effectively solve these problems, researchers proposed the concept of "unified paging" and realized the overlap of I/O and calculation by prefetching adapter weights

Unified Paging

Researchers expanded the concept of PagedAttention into Unified Paging. Unified paging is used not only to manage KV cache, but also to manage adapter weights. Unified paging uses a unified memory pool to jointly manage KV cache and adapter weights. To achieve this, they first statically allocate a large buffer to the memory pool, which utilizes all available space except for the space used to store the base model weights and temporary activation tensors. The KV cache and adapter weights are stored in the memory pool in a paged manner, and each page corresponds to an H vector. Therefore, a KV cache tensor with sequence length S occupies S pages, while an R-level LoRA weight tensor occupies R pages. Figure 3 shows the layout of the memory pool, where the KV cache and adapter weights are stored in an interleaved and non-contiguous manner. This approach greatly reduces fragmentation and ensures that different levels of adapter weights can co-exist with the dynamic KV cache in a structured and systematic way

S-LoRA: It is possible to run thousands of large models on one GPU

Tensor Parallel

In addition, the researchers designed a novel tensor parallel strategy for batch LoRA inference to support multi-GPU inference of large Transformer models. Tensor parallelism is the most widely used parallel approach because its single-program, multiple-data paradigm simplifies its implementation and integration with existing systems. Tensor parallelism can reduce memory usage and latency per GPU when serving large models. In this setting, additional LoRA adapters introduce new weight matrices and matrix multiplications, which require new partitioning strategies for these additions.

S-LoRA: It is possible to run thousands of large models on one GPU

Evaluation

Finally, the researchers passed the test for Llama-7B/13B/30B/70B Serving to evaluate S-LoRA

S-LoRA: It is possible to run thousands of large models on one GPU

The results show that S-LoRA can serve thousands of LoRA adapters on a single GPU or multiple GPUs , and the overhead is very small. S-LoRA achieves up to 30x higher throughput compared to Huggingface PEFT, a state-of-the-art parameter-efficient fine-tuning library. S-LoRA increases throughput by 4x and increases the number of service adapters by several orders of magnitude compared to using a high-throughput service system vLLM that supports LoRA services.

For more research details, please refer to the original paper.

The above is the detailed content of S-LoRA: It is possible to run thousands of large models on one GPU. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
How to Run LLM Locally Using LM Studio? - Analytics VidhyaHow to Run LLM Locally Using LM Studio? - Analytics VidhyaApr 19, 2025 am 11:38 AM

Running large language models at home with ease: LM Studio User Guide In recent years, advances in software and hardware have made it possible to run large language models (LLMs) on personal computers. LM Studio is an excellent tool to make this process easy and convenient. This article will dive into how to run LLM locally using LM Studio, covering key steps, potential challenges, and the benefits of having LLM locally. Whether you are a tech enthusiast or are curious about the latest AI technologies, this guide will provide valuable insights and practical tips. Let's get started! Overview Understand the basic requirements for running LLM locally. Set up LM Studi on your computer

Guy Peri Helps Flavor McCormick's Future Through Data TransformationGuy Peri Helps Flavor McCormick's Future Through Data TransformationApr 19, 2025 am 11:35 AM

Guy Peri is McCormick’s Chief Information and Digital Officer. Though only seven months into his role, Peri is rapidly advancing a comprehensive transformation of the company’s digital capabilities. His career-long focus on data and analytics informs

What is the Chain of Emotion in Prompt Engineering? - Analytics VidhyaWhat is the Chain of Emotion in Prompt Engineering? - Analytics VidhyaApr 19, 2025 am 11:33 AM

Introduction Artificial intelligence (AI) is evolving to understand not just words, but also emotions, responding with a human touch. This sophisticated interaction is crucial in the rapidly advancing field of AI and natural language processing. Th

12 Best AI Tools for Data Science Workflow - Analytics Vidhya12 Best AI Tools for Data Science Workflow - Analytics VidhyaApr 19, 2025 am 11:31 AM

Introduction In today's data-centric world, leveraging advanced AI technologies is crucial for businesses seeking a competitive edge and enhanced efficiency. A range of powerful tools empowers data scientists, analysts, and developers to build, depl

AV Byte: OpenAI's GPT-4o Mini and Other AI InnovationsAV Byte: OpenAI's GPT-4o Mini and Other AI InnovationsApr 19, 2025 am 11:30 AM

This week's AI landscape exploded with groundbreaking releases from industry giants like OpenAI, Mistral AI, NVIDIA, DeepSeek, and Hugging Face. These new models promise increased power, affordability, and accessibility, fueled by advancements in tr

Perplexity's Android App Is Infested With Security Flaws, Report FindsPerplexity's Android App Is Infested With Security Flaws, Report FindsApr 19, 2025 am 11:24 AM

But the company’s Android app, which offers not only search capabilities but also acts as an AI assistant, is riddled with a host of security issues that could expose its users to data theft, account takeovers and impersonation attacks from malicious

Everyone's Getting Better At Using AI: Thoughts On Vibe CodingEveryone's Getting Better At Using AI: Thoughts On Vibe CodingApr 19, 2025 am 11:17 AM

You can look at what’s happening in conferences and at trade shows. You can ask engineers what they’re doing, or consult with a CEO. Everywhere you look, things are changing at breakneck speed. Engineers, and Non-Engineers What’s the difference be

Rocket Launch Simulation and Analysis using RocketPy - Analytics VidhyaRocket Launch Simulation and Analysis using RocketPy - Analytics VidhyaApr 19, 2025 am 11:12 AM

Simulate Rocket Launches with RocketPy: A Comprehensive Guide This article guides you through simulating high-power rocket launches using RocketPy, a powerful Python library. We'll cover everything from defining rocket components to analyzing simula

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!

PhpStorm Mac version

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool