search
HomeTechnology peripheralsAIApple's large Vincent picture model unveiled: Russian matryoshka-like spread, supporting 1024x1024 resolution

I am used to Stable Diffusion, and now I finally have a Matryoshka-style Diffusion model, made by Apple.

#In the era of generative AI, diffusion models have become a popular tool for generative AI applications such as image, video, 3D, audio, and text generation. However, extending diffusion models to high-resolution domains still faces significant challenges because the model must recode all high-resolution inputs at each step. Solving these challenges requires the use of deep architectures with attention blocks, which makes optimization more difficult and consumes more computing power and memory.

How to do it? Some recent work has focused on investigating efficient network architectures for high-resolution images. However, none of the existing methods have demonstrated results beyond 512×512 resolution, and the generation quality lags behind mainstream cascade or latent methods.

We take OpenAI DALL-E 2, Google IMAGEN and NVIDIA eDiffI as examples. They save computation by learning a low-resolution model and multiple super-resolution diffusion models. force, where each component is trained individually. On the other hand, the latent diffusion model (LDM) only learns a low-resolution diffusion model and relies on a separately trained high-resolution autoencoder. For both solutions, multi-stage pipelines complicate training and inference, often requiring careful tuning or hyperparameters.

In this article, researchers propose the Matryoshka Diffusion Models (MDM), which is a new diffusion model for end-to-end high-resolution image generation. Model. The code will be released soon.

Apples large Vincent picture model unveiled: Russian matryoshka-like spread, supporting 1024x1024 resolution

Paper address: https://arxiv.org/pdf/2310.15111.pdf

The research proposed The main idea is to use the low-resolution diffusion process as part of the high-resolution generation by performing a joint diffusion process at multiple resolutions using a nested UNet architecture.

The study found that: MDM and nested UNet architecture together achieve 1) multi-resolution loss: greatly improving the convergence speed of high-resolution input denoising; 2) Efficient progressive training plan, starting from training a low-resolution diffusion model and gradually adding high-resolution inputs and outputs according to the plan. Experimental results show that combining multi-resolution loss with progressive training can achieve a better balance between training cost and model quality.

This study evaluates MDM on class-conditional image generation as well as text-conditional image and video generation. MDM allows training high-resolution models without the use of cascades or latent diffusion. Ablation studies show that both multi-resolution loss and progressive training greatly improve training efficiency and quality.

Let’s enjoy the following pictures and videos generated by MDM.
Apples large Vincent picture model unveiled: Russian matryoshka-like spread, supporting 1024x1024 resolution

Apples large Vincent picture model unveiled: Russian matryoshka-like spread, supporting 1024x1024 resolution

Method Overview

Researcher Introduction The MDM diffusion model is trained end-to-end in high resolution while leveraging hierarchically structured data formation. MDM first generalizes the standard diffusion model in diffusion space and then proposes a dedicated nested architecture and training process.

First let’s look at how to generalize the standard diffusion model in the extended space.

The difference from cascade or latent methods is that MDM learns a single diffusion process with a hierarchical structure by introducing multi-resolution diffusion processes in an expansion space . The details are shown in Figure 2 below.

Apples large Vincent picture model unveiled: Russian matryoshka-like spread, supporting 1024x1024 resolution

Specifically, given a data point x ∈ R^N, the researcher defines a time-related latent variable z_t = z_t^1, . . . , z_t^R ∈ R^N_1...NR.

Apples large Vincent picture model unveiled: Russian matryoshka-like spread, supporting 1024x1024 resolution

Researchers say that conducting diffusion modeling in extended space has the following two advantages. For one, we typically care about the full-resolution output z_t^R during inference, then all other intermediate resolutions are treated as additional latent variables z_t^r, increasing the complexity of modeling the distribution. Second, multi-resolution dependencies provide the opportunity to share weights and computations across z_t^r, thereby redistributing computation in a more efficient manner and enabling efficient training and inference.

Let’s see how nested architecture (NestedUNet) works.

Similar to typical diffusion models, researchers use a UNet network structure to implement MDM, where residual connections and computational blocks are used in parallel to preserve fine-grained input information. The computational block here contains multiple layers of convolution and self-attention layers. The codes for NestedUNet and standard UNet are as follows.

Apples large Vincent picture model unveiled: Russian matryoshka-like spread, supporting 1024x1024 resolution

In addition to its simplicity compared to other hierarchical methods, NestedUNet allows calculations to be distributed in the most efficient way. As shown in Figure 3 below, early exploration by researchers found that MDM achieves significantly better scalability when allocating most parameters and calculations at the lowest resolution.

Apples large Vincent picture model unveiled: Russian matryoshka-like spread, supporting 1024x1024 resolution

Finally learn.

Researchers use conventional denoising targets to train MDM at multiple resolutions, as shown in equation (3) below.

Apples large Vincent picture model unveiled: Russian matryoshka-like spread, supporting 1024x1024 resolution

Progressive training is used here. The researchers directly trained MDM end-to-end according to the above formula (3) and demonstrated better convergence than the original baseline method. They found that using a simple progressive training method similar to that proposed in the GAN paper greatly accelerated the training of high-resolution models.

This training method avoids costly high-resolution training from the beginning and accelerates overall convergence. Not only that, they also incorporated mixed-resolution training, which trains samples with different final resolutions simultaneously in a single batch.

Experiments and results

##MDM is a general technology that can be gradually Any problems compressing input dimensions. A comparison of MDM with the baseline approach is shown in Figure 4 below.

Apples large Vincent picture model unveiled: Russian matryoshka-like spread, supporting 1024x1024 resolution

Table 1 gives the comparison results on ImageNet (FID-50K) and COCO (FID-30K).

Apples large Vincent picture model unveiled: Russian matryoshka-like spread, supporting 1024x1024 resolution

Figures 5, 6 and 7 below show the performance of MDM in image generation (Figure 5), text to image (Figure 6) and text to video (Figure 7) result. Despite being trained on a relatively small dataset, MDM shows strong zero-shot ability to generate high-resolution images and videos.

Apples large Vincent picture model unveiled: Russian matryoshka-like spread, supporting 1024x1024 resolution

Apples large Vincent picture model unveiled: Russian matryoshka-like spread, supporting 1024x1024 resolution

Apples large Vincent picture model unveiled: Russian matryoshka-like spread, supporting 1024x1024 resolution

Interested readers can read the original text of the paper to learn more about the research content.

The above is the detailed content of Apple's large Vincent picture model unveiled: Russian matryoshka-like spread, supporting 1024x1024 resolution. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:机器之心. If there is any infringement, please contact admin@php.cn delete
The Hidden Dangers Of AI Internal Deployment: Governance Gaps And Catastrophic RisksThe Hidden Dangers Of AI Internal Deployment: Governance Gaps And Catastrophic RisksApr 28, 2025 am 11:12 AM

The unchecked internal deployment of advanced AI systems poses significant risks, according to a new report from Apollo Research. This lack of oversight, prevalent among major AI firms, allows for potential catastrophic outcomes, ranging from uncont

Building The AI PolygraphBuilding The AI PolygraphApr 28, 2025 am 11:11 AM

Traditional lie detectors are outdated. Relying on the pointer connected by the wristband, a lie detector that prints out the subject's vital signs and physical reactions is not accurate in identifying lies. This is why lie detection results are not usually adopted by the court, although it has led to many innocent people being jailed. In contrast, artificial intelligence is a powerful data engine, and its working principle is to observe all aspects. This means that scientists can apply artificial intelligence to applications seeking truth through a variety of ways. One approach is to analyze the vital sign responses of the person being interrogated like a lie detector, but with a more detailed and precise comparative analysis. Another approach is to use linguistic markup to analyze what people actually say and use logic and reasoning. As the saying goes, one lie breeds another lie, and eventually

Is AI Cleared For Takeoff In The Aerospace Industry?Is AI Cleared For Takeoff In The Aerospace Industry?Apr 28, 2025 am 11:10 AM

The aerospace industry, a pioneer of innovation, is leveraging AI to tackle its most intricate challenges. Modern aviation's increasing complexity necessitates AI's automation and real-time intelligence capabilities for enhanced safety, reduced oper

Watching Beijing's Spring Robot RaceWatching Beijing's Spring Robot RaceApr 28, 2025 am 11:09 AM

The rapid development of robotics has brought us a fascinating case study. The N2 robot from Noetix weighs over 40 pounds and is 3 feet tall and is said to be able to backflip. Unitree's G1 robot weighs about twice the size of the N2 and is about 4 feet tall. There are also many smaller humanoid robots participating in the competition, and there is even a robot that is driven forward by a fan. Data interpretation The half marathon attracted more than 12,000 spectators, but only 21 humanoid robots participated. Although the government pointed out that the participating robots conducted "intensive training" before the competition, not all robots completed the entire competition. Champion - Tiangong Ult developed by Beijing Humanoid Robot Innovation Center

The Mirror Trap: AI Ethics And The Collapse Of Human ImaginationThe Mirror Trap: AI Ethics And The Collapse Of Human ImaginationApr 28, 2025 am 11:08 AM

Artificial intelligence, in its current form, isn't truly intelligent; it's adept at mimicking and refining existing data. We're not creating artificial intelligence, but rather artificial inference—machines that process information, while humans su

New Google Leak Reveals Handy Google Photos Feature UpdateNew Google Leak Reveals Handy Google Photos Feature UpdateApr 28, 2025 am 11:07 AM

A report found that an updated interface was hidden in the code for Google Photos Android version 7.26, and each time you view a photo, a row of newly detected face thumbnails are displayed at the bottom of the screen. The new facial thumbnails are missing name tags, so I suspect you need to click on them individually to see more information about each detected person. For now, this feature provides no information other than those people that Google Photos has found in your images. This feature is not available yet, so we don't know how Google will use it accurately. Google can use thumbnails to speed up finding more photos of selected people, or may be used for other purposes, such as selecting the individual to edit. Let's wait and see. As for now

Guide to Reinforcement Finetuning - Analytics VidhyaGuide to Reinforcement Finetuning - Analytics VidhyaApr 28, 2025 am 09:30 AM

Reinforcement finetuning has shaken up AI development by teaching models to adjust based on human feedback. It blends supervised learning foundations with reward-based updates to make them safer, more accurate, and genuinely help

Let's Dance: Structured Movement To Fine-Tune Our Human Neural NetsLet's Dance: Structured Movement To Fine-Tune Our Human Neural NetsApr 27, 2025 am 11:09 AM

Scientists have extensively studied human and simpler neural networks (like those in C. elegans) to understand their functionality. However, a crucial question arises: how do we adapt our own neural networks to work effectively alongside novel AI s

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

Atom editor mac version download

Atom editor mac version download

The most popular open source editor