


Google is optimizing the diffusion model. Samsung mobile phones run Stable Diffusion and produce images in 12 seconds.
Stable Diffusion is as well-known in the field of image generation as ChatGPT in the conversation large model. It is capable of creating realistic images of any given input text in tens of seconds. Because Stable Diffusion has more than 1 billion parameters, and due to limited computing and memory resources on the device, this model is primarily run in the cloud.
Without careful design and implementation, running these models on a device may result in increased latency due to the iterative denoising process and excessive memory consumption.
How to run Stable Diffusion on the device has aroused everyone's research interest. Previously, some researchers developed an application that uses Stable Diffusion to generate images on the iPhone 14 Pro. Takes one minute and uses approximately 2GiB of application memory.
Apple has also made some optimizations to this before. They can generate an image with a resolution of 512x512 in half a minute on iPhone, iPad, Mac and other devices. Qualcomm follows closely behind, running Stable Diffusion v1.5 on Android phones, generating images with a resolution of 512x512 in less than 15 seconds.
Recently, in a paper published by Google "Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations", they implemented a GPU-driven Stable Diffusion 1.4 is run on the device, achieving SOTA inference latency performance (on Samsung S23 Ultra, it only takes 11.5 seconds to generate a 512 × 512 image through 20 iterations). Furthermore, this study is not specific to one device; rather, it is a general approach applicable to improving all potential diffusion models.
This research opens up many possibilities for running generative AI locally on your phone, without a data connection or cloud server. Stable Diffusion was only released last fall, and it can already be plugged into devices and run today, which shows how fast this field is developing.
##Paper address: https://arxiv.org/pdf/2304.11267.pdf
In order to achieve this generation speed, Google has put forward some optimization suggestions. Let’s take a look at how Google optimizes.
Method introductionThis research aims to propose optimization methods to improve the speed of large-scale diffusion model Vincentian diagrams. Some optimization suggestions are proposed for Stable Diffusion. These optimization suggestions are also Suitable for other large diffusion models.
First let’s take a look at the main components of Stable Diffusion, including: text embedder (text embedder), noise generation (noise generation), denoising neural network (denoising neural network) and Image decoder (image decoder, as shown in Figure 1 below.
Specialized kernel: Group Norm and GELU
Group Normalization (GN) method The working principle is to divide the channels of the feature map into smaller groups and normalize each group independently, thus making GN less dependent on batch size and more suitable for various batch sizes and network architectures. . Instead of performing reshape, mean, variance, and normalization operations in sequence, this research designed a unique GPU shader form of kernel that can perform all these operations in one GPU command without any intermediate Tensor.Gaussian error linear unit (GELU), as a commonly used model activation function, contains a large number of numerical calculations, such as multiplication, addition and Gaussian error function. This study uses a A dedicated shader to integrate these numerical calculations and their accompanying split and multiplication operations so that they can be performed in a single AI paint call.
Improving the efficiency of the attention module The text-to-image transformer in Stable Diffusion helps model conditional distributions, which is crucial for text-to-image generation tasks. However, self/cross-attention mechanisms encounter difficulties in processing long sequences due to memory complexity and time complexity. Based on this, this study proposes two optimization methods to alleviate the computational bottleneck. On the one hand, in order to avoid performing the entire softmax calculation on a large matrix, this study uses a GPU shader to reduce computational operations, which greatly reduces the memory footprint and overall latency of the intermediate tensor. The specific method is shown in Figure 2 below.
On the other hand, this study uses FlashAttention [7], an IO-aware precise attention algorithm, which enables high Bandwidth Memory (HBM) requires fewer accesses than standard attention mechanisms, improving overall efficiency.
Winograd Convolution
Winograd convolution converts the convolution operation into a series of matrix multiplications. This method can reduce many multiplication operations and improve calculation efficiency. However, this also increases memory consumption and numerical errors, especially when using larger tiles.
The backbone of Stable Diffusion relies heavily on 3×3 convolutional layers, especially in the image decoder, where they account for 90%. This study provides an in-depth analysis of this phenomenon to explore the potential benefits of using Winograd with different tile sizes on 3 × 3 kernel convolutions. Research has found that a tile size of 4 × 4 is optimal as it provides the best balance between computational efficiency and memory utilization.
The study was benchmarked on a variety of devices: Samsung S23 Ultra (Adreno 740) and iPhone 14 Pro Max (A16). The benchmark results are shown in Table 1 below:
It is obvious that as each optimization is activated, the latency gradually decreases (It can be understood that the time to generate images is reduced). Specifically, compared to the baseline: 52.2% latency reduction on Samsung S23 Ultra; 32.9% latency reduction on iPhone 14 Pro Max. In addition, the study also evaluates the end-to-end latency of Samsung S23 Ultra, generating a 512 × 512 pixel image within 20 denoising iteration steps, achieving SOTA results in less than 12 seconds.
Small devices can run their own generative artificial intelligence models. What does this mean for the future? We can expect a wave.
The above is the detailed content of Google is optimizing the diffusion model. Samsung mobile phones run Stable Diffusion and produce images in 12 seconds.. For more information, please follow other related articles on the PHP Chinese website!

The term "AI-ready workforce" is frequently used, but what does it truly mean in the supply chain industry? According to Abe Eshkenazi, CEO of the Association for Supply Chain Management (ASCM), it signifies professionals capable of critic

The decentralized AI revolution is quietly gaining momentum. This Friday in Austin, Texas, the Bittensor Endgame Summit marks a pivotal moment, transitioning decentralized AI (DeAI) from theory to practical application. Unlike the glitzy commercial

Enterprise AI faces data integration challenges The application of enterprise AI faces a major challenge: building systems that can maintain accuracy and practicality by continuously learning business data. NeMo microservices solve this problem by creating what Nvidia describes as "data flywheel", allowing AI systems to remain relevant through continuous exposure to enterprise information and user interaction. This newly launched toolkit contains five key microservices: NeMo Customizer handles fine-tuning of large language models with higher training throughput. NeMo Evaluator provides simplified evaluation of AI models for custom benchmarks. NeMo Guardrails implements security controls to maintain compliance and appropriateness

AI: The Future of Art and Design Artificial intelligence (AI) is changing the field of art and design in unprecedented ways, and its impact is no longer limited to amateurs, but more profoundly affecting professionals. Artwork and design schemes generated by AI are rapidly replacing traditional material images and designers in many transactional design activities such as advertising, social media image generation and web design. However, professional artists and designers also find the practical value of AI. They use AI as an auxiliary tool to explore new aesthetic possibilities, blend different styles, and create novel visual effects. AI helps artists and designers automate repetitive tasks, propose different design elements and provide creative input. AI supports style transfer, which is to apply a style of image

Zoom, initially known for its video conferencing platform, is leading a workplace revolution with its innovative use of agentic AI. A recent conversation with Zoom's CTO, XD Huang, revealed the company's ambitious vision. Defining Agentic AI Huang d

Will AI revolutionize education? This question is prompting serious reflection among educators and stakeholders. The integration of AI into education presents both opportunities and challenges. As Matthew Lynch of The Tech Edvocate notes, universit

The development of scientific research and technology in the United States may face challenges, perhaps due to budget cuts. According to Nature, the number of American scientists applying for overseas jobs increased by 32% from January to March 2025 compared with the same period in 2024. A previous poll showed that 75% of the researchers surveyed were considering searching for jobs in Europe and Canada. Hundreds of NIH and NSF grants have been terminated in the past few months, with NIH’s new grants down by about $2.3 billion this year, a drop of nearly one-third. The leaked budget proposal shows that the Trump administration is considering sharply cutting budgets for scientific institutions, with a possible reduction of up to 50%. The turmoil in the field of basic research has also affected one of the major advantages of the United States: attracting overseas talents. 35

OpenAI unveils the powerful GPT-4.1 series: a family of three advanced language models designed for real-world applications. This significant leap forward offers faster response times, enhanced comprehension, and drastically reduced costs compared t


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

WebStorm Mac version
Useful JavaScript development tools
