Home >Technology peripherals >AI >Real-time image speed increased by 5-10 times, Tsinghua LCM/LCM-LoRA became popular, with over one million views and over 200,000 downloads
Generative models enter the "real-time" era?
Using Vincentian diagrams and Tusheng diagrams is no longer a new thing. However, in the process of using these tools, we found that they often run slowly, causing us to wait for a while to get the generated results
But recently, a model called "LCM" changed this In some cases, it can even generate images continuously in real time.
# Source: https://twitter.com/javilopen/status/1724398666889224590
The full name of LCM is Latent Consistency Models (latent consistency model), which was built by researchers from the Interdisciplinary Information Institute of Tsinghua University. Before the release of this model, latent diffusion models (LDM) such as Stable Diffusion were very slow to generate due to the computational complexity of the iterative sampling process. Through some innovative methods, LCM can generate high-resolution images with only a few steps of inference. According to statistics, LCM can increase the efficiency of mainstream Vincentian graph models by 5-10 times, so it can show real-time effects.Technical report link: https://arxiv.org/ pdf/2311.05556.pdf
The rapid generation capability of latent consistency models opens up new application fields for image generation technology. This model enables high-speed image generation by quickly processing and rendering real-time captured images based on input text (prompts). This means that users can customize the scenes or visual effects they want to displayOn the X platform, many researchers have also posted the generation effects they achieved using this model, including pictures, videos Generation, image editing, real-time video rendering and other applications.## Picture source: https://twitter.com/javilopen/status/1724398666889224590
##The content that needs to be rewritten is: Image source: https://twitter.com/javilopen/status/1724398708052414748
Our team has completely open sourced the code of LCM, and disclosed the model weight files and online demonstrations based on internal distillation of pre-trained models such as SD-v1.5 and SDXL. In addition, the Hugging Face team has integrated the potential consistency model into the diffusers official repository, and updated the relevant code frameworks of LCM and LCM-LoRA in two consecutive official versions v0.22.0 and v0.23.0, providing an understanding of the potential consistency model. Good support for consistency models. The model published on Hugging Face ranked first in today's popularity list, becoming the most popular Vincent graph model on all platforms and the third most popular model in all categories
Next, we will introduce the two research results of LCM and LCM-LoRA respectively.
LCM: Generate high-resolution images with only a few steps of inference
AIGC era, including diffusion model-based Vincentian maps such as Stable Diffusion and DALL-E 3 The model has received widespread attention. Diffusion models produce high-quality images by adding noise to training data and then reversing the process. However, the diffusion model requires multi-step sampling to generate images, which is a relatively slow process and increases the cost of inference. The slow multi-step sampling problem is a major bottleneck when deploying such models.
The Consistency Model (CM) proposed by Dr. Song Yang of OpenAI this year provides an idea to solve the above problems. It was pointed out that the consistency model is designed to have the ability to be generated in a single step, showing great potential to accelerate the generation of diffusion models. However, since the consistency model is limited to unconditional image generation, many practical applications including Vincentian images, graph-generated images, etc. are still unable to enjoy the potential advantages of this model.
Latent Consistency Model (LCM) was born to solve the above problems. The latent consistency model supports image generation tasks under given conditions, and combines latent coding, classifier-free guidance and many other technologies that are widely used in diffusion models, greatly speeding up the conditional denoising process and providing many practical applications. The task opens a pathway.
LCM Technical Details
Specifically, the latent consistency model interprets the denoising problem of the diffusion model as solving the augmented probabilistic flow ordinary differential equation shown below the process of.
The solution efficiency can be improved by improving the traditional diffusion model. The traditional method uses numerical iteration to solve ordinary differential equations, but even with a more accurate solver, the accuracy of each step is limited, and about 10 iterations are needed to obtain satisfactory results
With traditional iterative solving Different from ordinary differential equations, the potential consistency model requires a single-step solution to the ordinary differential equation directly and predicts the final solution of the equation. Theoretically, the picture can be generated in a single step
In order to train a latent consistency model, this study proposes that the rapid generation of the model can be achieved with very little resource consumption by fine-tuning parameters of a pre-trained diffusion model (for example, stable diffusion). This distillation process is based on the optimization of the consistency loss function proposed by Dr. Song Yang. In order to obtain better performance and reduce computational overhead on the Vincentian graph task, this paper proposes three key technologies:
Rewritten content: (1) By using a pre-trained autoencoder, the original The image is encoded into a representation in the latent space to reduce redundant information when compressing the image and make the image more semantically consistent
(2) Classifier-free bootstrapping is used as an input parameter of the model to distill into the latent In the consistency model, while enjoying better picture-text consistency brought by classifier-free guidance, since the classifier-free guidance amplitude is distilled into the latent consistency model as an input parameter, it can reduce inference time. The required computational overhead;
(3) Using the skip strategy to calculate the consistency loss greatly speeds up the distillation process of the potential consistency model. The pseudocode of the distillation algorithm of the latent consistency model is shown in the figure below.
# Qualitative and quantitative results show that the latent consistency model has the ability to quickly generate high-quality images. This model can generate high-quality images in 1 to 4 steps. By comparing the actual inference time and the generation quality indicator FID, it can be seen that compared to the DPM solver, one of the fastest existing samplers, the potential consistency model can accelerate the actual inference time by about 4 times while maintaining the same generation quality.
# Image generated by LCM
#LCM-LORA: A general stable transmission acceleration module
Based on the latent consistency model, the author team then further released their technical report on LCM-LoRA. Since the distillation process of the latent consistency model can be regarded as a fine-tuning process for the original pre-trained model, efficient fine-tuning techniques such as LoRA can be used to train the latent consistency model. Thanks to the resource savings brought by LoRA technology, the author's team conducted distillation on the SDXL model with the largest number of parameters in the Stable Diffusion series, and successfully obtained a potential consensus that can be generated in a very few steps that is comparable to dozens of SDXL steps. sexual model.In the introduction of the paper, the study points out that although the latent diffusion model (LDM) has been successful in generating text images and line drawing images, its slow reverse sampling process limits real-time applications and has an impact on user experience. Current open source models and acceleration technologies cannot yet achieve real-time generation on ordinary consumer-grade GPUs. Methods to accelerate LDM are generally divided into two categories: the first category involves advanced ODE solvers, such as DDIM, DPMSolver and DPM -Solver to speed up the generation process. The second category involves distilling LDM to simplify its functionality. ODE - Solver reduces the inference steps but still requires significant computational overhead, especially when employing classifier-free guidance. Meanwhile, distillation methods such as Guided-Distill, while promising, face practical limitations due to their intensive computational requirements. Finding a balance between the speed and quality of LDM-generated images remains a challenge in the field.
Recently, inspired by the Consistency Model (CM), the Latent Consistency Model (LCM) emerged as a solution to the slow sampling problem in image generation. LCM regards the back-diffusion process as an enhanced probability flow ODE (PF-ODE) problem. This type of model innovatively predicts solutions in the latent space without the need for iterative solutions via numerical ODE solvers. As a result, they enable efficient synthesis of high-resolution images with only 1 to 4 inference steps. In addition, LCM also performs well in terms of distillation efficiency. It only takes 32 hours of training with A100 to complete the minimum step inference
On this basis, a method called latent consistency fine-tuning ( LCF) method, which can fine-tune pre-trained LCM without starting from a teacher diffusion model. For specialized datasets, such as anime, real photos, or fantasy image datasets, additional steps are required, such as distilling pre-trained LDM into LCM using Latent Consistent Distillation (LCD), or fine-tuning LCM directly using LCF. However, this additional training may hinder the rapid deployment of LCM on different datasets, which raises a key question: whether fast, training-free inference can be achieved on custom datasets. To answer the above questions, researchers proposed LCM-LoRA. LCM-LoRA is a general training-free acceleration module that can be directly plugged into various Stable-Diffusion (SD) fine-tuned models or SD LoRA to support fast inference with minimal steps. Compared with early numerical probabilistic flow ODE (PF-ODE) solvers such as DDIM, DPM-Solver and DPM-Solver, LCM-LoRA represents a new class of PF-ODE solver modules based on neural networks. It demonstrates strong generalization capabilities across various fine-tuned SD models and LoRA
LCM-LoRA overview plot. By introducing LoRA into the distillation process of LCM, the study significantly reduced the memory overhead of distillation, which allowed them to utilize limited resources to train larger models such as SDXL and SSD-1B. More importantly, the LoRA parameters (acceleration vector) obtained through LCM-LoRA training can be directly combined with other LoRA parameters (style vetcor) obtained by fine-tuning on a specific style dataset. Without any training, the model obtained by the linear combination of acceleration vector and style vetcor can generate images of a specific painting style with minimal sampling steps.
The technical details of LCM-LoRA can be rewritten as:
Generally speaking, the latent consistency model is trained using single-stage guided distillation Proceeding in this way, this method utilizes the pre-trained autoencoder latent space to distill the guidance diffusion model into LCM. This process involves enhancing the probabilistic flow ODE, which we can think of as a mathematical formula that ensures that the generated samples follow a trajectory that produces high-quality images. It is worth mentioning that the focus of distillation is to maintain the fidelity of these trajectories while significantly reducing the number of sampling steps required. Algorithm 1 provides the pseudocode for the LCD.
Since the distillation process of LCM is performed on the parameters of the pre-trained diffusion model, we can regard the latent consistency distillation as the fine-tuning process of the diffusion model, thus Some efficient parameter adjustment methods can be used, such as LoRA.LoRA обновляет предварительно обученную матрицу весов, применяя разложение низкого ранга. В частности, для весовой матрицы ее метод обновления выражается как , где во время процесса обучения W_0 остается неизменным, а обновление градиента применяется только к двум параметрам A и B. Следовательно, для входных данных как продукта матриц низкого ранга LoRA значительно уменьшает количество обучаемых параметров, тем самым уменьшая использование памяти.
В таблице ниже общее количество параметров полной модели сравнивается с обучаемыми параметрами при использовании технологии LoRA. Очевидно, что за счет включения технологии LoRA в процесс дистилляции LCM количество обучаемых параметров значительно сокращается, что эффективно снижает требования к памяти для обучения.
С помощью серии экспериментов это исследование показывает, что парадигма ЖК-дисплея может быть хорошо адаптирована к более крупным моделям, таким как SDXL и SSD-1B. Показаны результаты генерации различных моделей. на рисунке 2 показано.
Автор обнаружил, что использование технологии LoRA может повысить эффективность процесса дистилляции, а также обнаружил, что параметры LoRA, полученные в результате обучения, можно использовать в качестве общего модуля ускорения, который может быть напрямую использован с другими LoRA. Комбинированное использование параметров
Как показано на рисунке 1 выше, команда авторов обнаружила, что достаточно просто объединить «параметры стиля», полученные путем точной настройки конкретных данных стиля. устанавливается с помощью «параметров ускорения», полученных путем дистилляции латентной консистенции. Путем линейной комбинации можно получить новую модель потенциальной консистенции как с возможностью быстрой генерации, так и с особым стилем. Это открытие дает мощный импульс большому количеству моделей с открытым исходным кодом, которые уже существуют в существующем сообществе открытого исходного кода, позволяя этим моделям даже наслаждаться эффектами ускорения, обеспечиваемыми моделью скрытой согласованности, без какого-либо дополнительного обучения.
показывает эффект создания новой модели после использования этого метода для улучшения модели «стиль рисования, вырезанной из бумаги», как показано на рисунке нижеКороче говоря, LCM-LoRA — это общий модуль ускорения, не требующий обучения, для моделей со стабильной диффузией (SD). Он функционирует как автономный и эффективный решающий модуль на основе нейронной сети для прогнозирования решений PF-ODE, обеспечивая быстрый вывод с минимальными шагами на различных точно настроенных моделях SD и SD LoRA. Большое количество экспериментов по преобразованию текста в изображение доказывает сильную способность к обобщению и превосходство LCM-LoRA.
Введение командыВсе авторы статьи из Университет Цинхуа, двое. Соавторы — Ло Симиан и Тан Ицинь.
Ло Симиан — студент второго курса магистратуры факультета компьютерных наук и технологий Университета Цинхуа, его научным руководителем является профессор Чжао Син. Он окончил Школу больших данных Фуданьского университета со степенью бакалавра. Направление его исследований — мультимодальные генеративные модели. Он интересуется моделями диффузии, моделями согласованности и ускорением AIGC, а также занимается разработкой генеративных моделей следующего поколения. Ранее он опубликовал множество статей в качестве первого автора на ведущих конференциях, таких как ICCV и NeurIPS.
Тань Ицинь — студент второго курса магистратуры филиала Университета Цинхуа, и его научным руководителем является г-н Хуан Лунбо. Будучи студентом, он учился на факультете электронной техники Университета Цинхуа. Его исследовательские интересы в основном охватывают модели глубокого обучения с подкреплением и диффузии. В ходе предыдущих исследований он опубликовал несколько громких статей в качестве первого автора на научных конференциях, таких как ICLR, и выступил с устными докладами.
Стоит отметить, что один из двух был на курсе продвинутой компьютерной теории преподавателем колледжа Ли Цзянь, идея LCM была предложена и наконец представлена как итоговый курсовой проект. Среди трех преподавателей Ли Цзянь и Хуан Лунбо — доценты Института междисциплинарной информации Цинхуа, а Чжао Син — доцент Института междисциплинарной информации Цинхуа.
######Первый ряд (слева направо): Ло Сымянь, Тан Ицинь. Второй ряд (слева направо): Хуан Лунбо, Ли Цзянь, Чжао Син.
The above is the detailed content of Real-time image speed increased by 5-10 times, Tsinghua LCM/LCM-LoRA became popular, with over one million views and over 200,000 downloads. For more information, please follow other related articles on the PHP Chinese website!