Home > Article > Technology peripherals > Can Stable Diffusion surpass algorithms such as JPEG and improve image compression while maintaining clarity?
Text-basedImage generation models are very popular. Not only diffusion models are popular, but also the open source Stable Diffusion model.
Recently, a Swiss software engineer, Matthias Bühlmann, accidentally discovered that Stable Diffusion can not only be used to generate images; Compress bitmap images, even higher compression rate than JPEG and WebP.
For example, a photo of a llama, the original image is 768KB, it can be compressed to 5.66KB using JPEG, and Stable Diffusion can further compress it to 4.98KB , and can retain more high-resolution details and fewer compression artifacts, which is visibly better than other compression algorithms to the naked eye.
However, this compression method also has flaws, that is, is not suitable for compressing face and text images, in some cases Next, some original images will even be generated with no content.
Although retraining an autoencoder can also achieve a compression effect similar to Stable Diffusion, but using Stable Diffusion One of the main advantages is that someone has invested millions of funds to help you train one, so why do you spend money to train a compression model again?
Diffusion models are challenging the dominance of generative models, and the corresponding open source Stable Diffusion model is also setting off an artistic revolution in the machine learning community.
Stable Diffusion is obtained by connecting three trained neural networks in series, that is, a variational autoencoder (VAE) , U-Net model and a text encoder.
The variational autoencoder encodes and decodes the image in the image space to obtain the representation vector of the image in the latent space , represented by a vector with lower resolution (64x64) with higher precision (4x32bit) The source image (3x8 or 4x8bit of 512x512) .
VAE's training process of encoding images into latent space mainly relies on self-supervised learning, that is, the input and output are both source images. Therefore, as the model is further trained, different versions of the model will The latent space representation may look different.
After remapping and interpreting the latent space representation into a 4-channel color image using Stable Diffusion v1.4, it looks like the middle image in the figure below, in the source image Key features are still visible.
It should be noted that VAE round-trip encoding once is not lossless.
For example, after decoding, theANNA name on the blue tape is not as clear as the source image, and the readability is significantly reduced.
The variational autoencoder in Stable Diffusion v1.4 is not very good at representing small text and face images, I don’t know if it will be improved in v1.5. The main compression algorithm of Stable Diffusion is to use this latent space representation of images to generate new images from short text descriptions. Start from the random noise represented by the latent space, use a fully trained U-Net to iteratively remove the noise from the latent space image, and output the model with a simpler representation that it believes is in this noise The prediction of "seeing" is a bit like when we look at clouds, restore the shapes or faces in our minds from irregular graphics. When using Stable Diffusion to generate images, this iterative denoising step is guided by a third component, the text encoder, which provides U-Net with information about it Information on what one should try to see in the noise. However, for compression tasks, does not require a text encoder, so the experimental process only created an empty string encoding Used to tell U-Net to perform unguided denoising during the image reconstruction process. In order to use Stable Diffusion as an image compression codec, the algorithm needs to effectively compress the latent representation produced by VAE. It can be found in experiments that downsampling the latent representation or directly using existing lossy image compression methods will greatly reduce the quality of the reconstructed image. But the author found that VAE decoding seems to be very effective in quantization of latent representations. Scaling, clamping, and remapping of potentials from floating point to 8-bit unsigned integers produces only small visible reconstruction errors. #By quantizing the 8-bit latent representation, the data size represented by the image is now 64*64*4*8bit=16kB, which is much smaller than uncompressed The source image is 512*512*3*8bit=768kB If the number of bits of the latent representation is less than 8 bits, it will not produce better results. If palettizingand dithering are further performed on the image, the quantization effect will be improved again. Created a palette representation using 256*4*8 bit vectors and a latent representation of Floyd-Steinberg dithering, further compressing the data size to 64*64*8 256*4 *8bit=5kB The dithering of the latent space palette introduces noise, thus distorting the decoding results. However, since Stable Diffusion is based on the removal of latent noise, U-Net can be used to remove the noise caused by jitter. After 4 iterations, the reconstruction result is visually very close to the unquantized version. Although the amount of data is greatly reduced (the source image is 155 times larger than the compressed image), the effect is very good, but it also introduces Some artifacts (such as the heart pattern in the original image that is not present). Interestingly, this compression scheme introduces artifacts that have a greater impact on image content than image quality, and images compressed in this way may contain these types of compression artifacts. The author also used zlib to perform lossless compression on the palette and index. In the test samples, most of the compression results were less than 5kb, but this compression method still has more room for optimization. In order to evaluate this compression codec, the author did not use any standard test images found on the Internet, because the images on the Internet are likely to be trained by Stable Diffusion Concentrations have occurred, and compressing such images may result in an unfair contrast advantage. To make the comparison as fair as possible, the author used the highest quality encoder settings from the Python image library, as well as adding lossless data compression of the compressed JPG data using the mozjpeg library. It’s worth noting that while Stable Diffusion’s results subjectively look much better than JPG and WebP compressed images, they are not significantly better in terms of standard measurements like PSNR or SSIM. , but no worse. It's just that the types of artifacts introduced are less obvious because they affect image content more than they affect image quality. This compression method is also a bit dangerous, although the quality of the reconstructed features is high, the content may be affected by compression artifacts, even if it looks very sharp. For example, in a test image, although Stable Diffusion as a codec does a much better job of maintaining the quality of the image, even camera grain are preserved (something that most traditional compression algorithms struggle to achieve), but their content is still affected by compression artifacts, and fine features like building shapes may change. While it is certainly impossible to identify more true values in a JPG compressed image than in a Stable Diffusion compressed image, the Stable Diffusion compression results The high visual quality can be deceptive, as compression artifacts in JPG and WebP are easier to spot. If you also want to reproduce the experiment, the author has open sourced the code on . Finally, the author stated that the experiment designed in the article is still quite simple, but the effect is still surprising, .
The above is the detailed content of Can Stable Diffusion surpass algorithms such as JPEG and improve image compression while maintaining clarity?. For more information, please follow other related articles on the PHP Chinese website!