search
HomeTechnology peripheralsAIInsert the Stable Diffusion model into the iPhone and make it into an APP to produce pictures in one minute

Is it difficult to run Stable Diffusion on iPhone? In the article we are going to introduce today, the author gives the answer: it is not difficult, and the iPhone still has 50% performance left.

As we all know, every year Apple launches a new iPhone that claims to be faster and better in every way, mainly due to the rapid development of new vision models and image sensors. Take photography as an example. If you go back to 10 years ago, can you take high-quality pictures with an iPhone? The answer is no, because the development of technology is gradual. 10 years is enough to improve mobile phone photography technology.

Because of this (progressive) development pattern of technology, there will come a time when some programs will become almost unusable even on the best computing equipment. But these new programs with newly enabled scenarios attracted the attention of some users and people were willing to study it.

The author of this article is one of them. In the past 3 weeks, the author has developed an application that can generate (summon) images through Stable Diffusion, and then press your Edit it the way you like. The app takes just a minute to generate images on the latest iPhone 14 Pro, using about 2GiB of app memory, plus about 2GiB of initial data needs to be downloaded to get started.

App store link: https://apps.apple.com/us/app/draw-things-ai-generation/id6444050820

This result attracted many discussions among netizens. Some people began to worry about the power consumption of mobile phones, and joked: This is cool, but this seems to be a good way to consume mobile phone batteries.

Insert the Stable Diffusion model into the iPhone and make it into an APP to produce pictures in one minute

"I have never been so happy to feel the heat of my iPhone."

"This In the cold winter, you can use your mobile phone as a hand warmer."

However, while everyone is making fun of the heating problem of mobile phones, they also give this work a very high rating.

"This is incredible. It takes about 45 seconds to generate a complete image on my iPhone SE3 - which is almost the same speed as the original version on my M1 Pro macbook Hurry!」

Insert the Stable Diffusion model into the iPhone and make it into an APP to produce pictures in one minute

Optimize memory and hardware at the same time

How is this done? Next, let’s take a look at the author’s implementation process:

If you want to run Stable Diffusion on the iPhone and still have 50% performance savings, a major challenge is that you need 6GiB RAM Run the program on your iPhone device. 6GiB sounds like a lot, but if you use more than 2.8GiB on a 6GiB device, or 2GiB on a 4GiB device, iOS will kill your app.

So how much memory does the Stable Diffusion model require for inference?

This also starts with the structure of the model. Usually the Stable Diffusion model contains 4 parts: 1. Text encoder, which generates text feature vectors to guide image generation; 2. Optional image encoder, which encodes images into latent space (for image-to-image generation); 3 . Denoiser model, which slowly denoises the latent representation of the image from the noise; 4. Image decoder, which decodes the image from the latent representation.

The 1st, 2nd and 4th modules are run once during inference and require a maximum of about 1GiB. The denoiser model takes about 3.2GiB (full floating point) and needs to be executed multiple times, so the author wants to keep the module in RAM longer.

The original Stable Diffusion model required close to 10GiB to perform single image inference. Between a single input (2x4x64x64) and an output (2x4x64x64), there are many output layers interspersed. Not all layer outputs can be reused immediately, some of them must retain some parameters for subsequent use (residual networks).

For some time, researchers have been optimizing PyTorch Stable Diffusion. They have reserved temporary storage space for the NVIDIA CUDNN and CUBLAS libraries used by PyTorch. These optimizations are all to reduce memory usage. Therefore the Stable Diffusion model can run with cards as low as 4GiB.

But it still exceeded the author’s expectations. Therefore, the author began to focus on Apple hardware and optimization.

At first, the author considered 3.2GiB or 1.6GiB half floating point number. If he did not want to trigger Apple's OOM (Out of Memory, which refers to the memory occupied by the App reaching the limit of the iOS system for a single After the app occupies the upper memory limit and is forcibly killed by the system), the author has about 500MiB of space to use.

The first question is, what is the size of each intermediate output?

It turns out that most of them are relatively small, under 6MiB each (2x320x64x64). The framework used by the author (s4nnc) can reasonably package them into less than 50MiB for reuse.

It is worth mentioning that the denoiser has a self-attention mechanism that takes its own latent representation of the image as input. During the self-attention computation, there is a batch matrix of size 16x4096x4096, which after applying softmax is about 500MiB in FP16 and can be done "inplace", which means it can safely rewrite its input without Will not be damaged. Fortunately, both Apple and NVIDIA low-level libraries provide inplace softmax implementations, whereas higher-level libraries such as PyTorch do not.

So can it really be done using about 550MiB and 1.6GiB of memory?

On Apple hardware, a common choice for implementing a neural network backend is to use the MPSGraph framework. So the author first tried to use MPSGraph to implement all neural network operations. The peak memory usage at FP16 precision is about 6GiB, which is obviously much more than the expected memory usage. What's going on?

The author analyzed the reasons in detail. First, he did not use MPSGraph in the common TensorFlow way. MPSGraph requires encoding the entire computational graph, then consuming input/output tensors, handling internal allocations, and letting the user submit the entire graph for execution.

The author uses MPSGraph very much like PyTorch - as an operation execution engine. To perform inference tasks, many compiled MPSGraphExecutables are executed on the Metal command queue, each of which may hold some intermediate allocated memory. If submitted in one go, all these commands hold allocated memory until they complete execution.

A simple way to solve this problem is to adjust the submission speed. There is no need to submit all commands at once. In fact, Metal has a limit of 64 concurrent submissions per queue. The author tried changing to submitting 8 operations at a time, and the peak memory was reduced to 4GiB.

However, that's still 2 GiB more than the iPhone can handle.

To compute self-attention using CUDA, there is a common trick in the original Stable Diffusion code implementation: use permutation instead of transpose. This trick works because CUBLAS can handle permuted strided tensors directly, avoiding the need to use dedicated memory to transpose the tensor.

But MPSGraph does not have strided tensor support, a permuted tensor will be transposed internally anyway, which requires intermediate memory allocation. By explicitly transposing, allocations will be handled by higher-level layers, avoiding MPSGraph internal inefficiencies. Using this trick, the memory usage will be close to 3GiB.

It turns out that starting with iOS 16.0, MPSGraph can no longer make optimal allocation decisions for softmax. Even if the input and output tensors both point to the same data, MPSGraph allocates an additional output tensor and then copies the result to the location pointed to.

The author found that using the Metal Performance Shaders alternative met the requirements perfectly and reduced the memory usage to 2.5GiB without any performance degradation.

On the other hand, MPSGraph's GEMM kernel requires internal transposition. Explicit transposes won't help here either, since these transposes are not "inplace" operations of higher-level layers, and for a specific 500MiB size tensor, this extra allocation is unavoidable. By switching to Metal Performance Shaders, the project authors reclaimed another 500MiB with a performance penalty of about 1%, ultimately reducing memory usage to the ideal 2GiB.

The above is the detailed content of Insert the Stable Diffusion model into the iPhone and make it into an APP to produce pictures in one minute. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
A Comprehensive Guide to ExtrapolationA Comprehensive Guide to ExtrapolationApr 15, 2025 am 11:38 AM

Introduction Suppose there is a farmer who daily observes the progress of crops in several weeks. He looks at the growth rates and begins to ponder about how much more taller his plants could grow in another few weeks. From th

The Rise Of Soft AI And What It Means For Businesses TodayThe Rise Of Soft AI And What It Means For Businesses TodayApr 15, 2025 am 11:36 AM

Soft AI — defined as AI systems designed to perform specific, narrow tasks using approximate reasoning, pattern recognition, and flexible decision-making — seeks to mimic human-like thinking by embracing ambiguity. But what does this mean for busine

Evolving Security Frameworks For The AI FrontierEvolving Security Frameworks For The AI FrontierApr 15, 2025 am 11:34 AM

The answer is clear—just as cloud computing required a shift toward cloud-native security tools, AI demands a new breed of security solutions designed specifically for AI's unique needs. The Rise of Cloud Computing and Security Lessons Learned In th

3 Ways Generative AI Amplifies Entrepreneurs: Beware Of Averages!3 Ways Generative AI Amplifies Entrepreneurs: Beware Of Averages!Apr 15, 2025 am 11:33 AM

Entrepreneurs and using AI and Generative AI to make their businesses better. At the same time, it is important to remember generative AI, like all technologies, is an amplifier – making the good great and the mediocre, worse. A rigorous 2024 study o

New Short Course on Embedding Models by Andrew NgNew Short Course on Embedding Models by Andrew NgApr 15, 2025 am 11:32 AM

Unlock the Power of Embedding Models: A Deep Dive into Andrew Ng's New Course Imagine a future where machines understand and respond to your questions with perfect accuracy. This isn't science fiction; thanks to advancements in AI, it's becoming a r

Is Hallucination in Large Language Models (LLMs) Inevitable?Is Hallucination in Large Language Models (LLMs) Inevitable?Apr 15, 2025 am 11:31 AM

Large Language Models (LLMs) and the Inevitable Problem of Hallucinations You've likely used AI models like ChatGPT, Claude, and Gemini. These are all examples of Large Language Models (LLMs), powerful AI systems trained on massive text datasets to

The 60% Problem — How AI Search Is Draining Your TrafficThe 60% Problem — How AI Search Is Draining Your TrafficApr 15, 2025 am 11:28 AM

Recent research has shown that AI Overviews can cause a whopping 15-64% decline in organic traffic, based on industry and search type. This radical change is causing marketers to reconsider their whole strategy regarding digital visibility. The New

MIT Media Lab To Put Human Flourishing At The Heart Of AI R&DMIT Media Lab To Put Human Flourishing At The Heart Of AI R&DApr 15, 2025 am 11:26 AM

A recent report from Elon University’s Imagining The Digital Future Center surveyed nearly 300 global technology experts. The resulting report, ‘Being Human in 2035’, concluded that most are concerned that the deepening adoption of AI systems over t

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Chat Commands and How to Use Them
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.