Home > Article > Technology peripherals > Small scale, high efficiency: DeepMind launches multi-modal solution Mirasol 3B
One of the main challenges faced by multi-modal learning is the need to fuse heterogeneous modalities such as text, audio, and video. Multi-modal models need to combine signals from different sources. However, these modalities have different characteristics and are difficult to combine through a single model. For example, video and text have different sampling rates
Recently, a research team from Google DeepMind decoupled multi-modal models into multiple independent, specialized autoregressive models, according to Features of various modalities to process input.
Specifically, the study proposes a multimodal model called Mirasol3B. Mirasol3B consists of time-synchronized autoregressive components for audio and video as well as autoregressive components for contextual modalities. These modes are not necessarily aligned in time, but are arranged in order
Paper address: https://arxiv.org/abs/2311.05698
Mirasol3B reaches SOTA level in multi-modal benchmarks, outperforming larger models. By learning more compact representations, controlling the sequence length of audio-video feature representations, and modeling based on temporal correspondences, Mirasol3B is able to effectively meet the high computational requirements of multi-modal inputs.
Mirasol3B is an audio-video-text multimodal model where autoregressive modeling is decoupled into temporal alignment Autoregressive components for modalities (e.g. audio, video), and autoregressive components for non-temporally aligned contextual modalities (e.g. text). Mirasol3B uses cross-attention weights to coordinate the learning process of these components. This decoupling makes the parameter distribution within the model more reasonable, allocates enough capacity to the modalities (video and audio), and makes the overall model more lightweight.
As shown in Figure 1, Mirasol3B consists of two main learning components: the autoregressive component and the input combination component. Among them, the autoregressive component is designed to handle nearly simultaneous multi-modal inputs, such as video and audio, for timely input combinations
When rewriting the content, you need to keep the original meaning unchanged and change the language to Chinese. The study proposes to segment the temporally aligned modalities into time segments and learn audio-video joint representations in the time segments. Specifically, this research proposes a modal joint feature learning mechanism called "Combiner". "Combiner" fuses modal features within the same time period to generate a more compact representation
"Combiner" extracts a primary spatiotemporal representation from the original modal input and captures the video The dynamic characteristics, combined with its synchronic audio features, the model can receive multi-modal input at different rates and perform well when processing longer videos.
"Combiner" effectively meets the need for modal representation to be both efficient and informative. It can fully cover events and activities in video and other concurrent modalities, and can be used in subsequent autoregressive models to learn long-term dependencies.
In order to process video and audio signals and accommodate longer video/audio inputs, they are split into (roughly synchronized in time) Small pieces, and then learn joint audio-visual representation through "Combiner". The second component handles context, or temporally misaligned signals such as global textual information, which are often still continuous. It is also autoregressive and uses the combined latent space as cross-attention input.
The learning component contains video and audio, and its parameter is 3B; while the component without audio is 2.9B. Among them, most parameters are used in audio and video autoregressive models. Mirasol3B usually processes 128-frame videos, and can also process longer videos, such as 512 frames.
Due to the design of the partition and "Combiner" model architecture, add more frames, or increase The size and number of blocks, etc., will only increase the parameters slightly, solving the problem that longer videos require more parameters and larger memory.
This study tested and evaluated Mirasol3B on the standard VideoQA benchmark, long video VideoQA benchmark, and audio video benchmark.
The test results on the VideoQA data set MSRVTTQA are shown in Table 1 below. Mirasol3B surpasses the current SOTA model, as well as larger models such as PaLI-X and Flamingo.
In terms of long video question and answer, this study tested and evaluated Mirasol3B on the ActivityNet-QA and NExTQA data sets. The results are shown in Table 2 below. Display:
In the end, the study selected KineticsSound, VGG-Sound, and Epic-Sound for audio-video benchmarking and adopted an open Generate evaluation. The experimental results are shown in Table 3 below:
Interested readers can read the original text of the paper to learn more about the research content.
The above is the detailed content of Small scale, high efficiency: DeepMind launches multi-modal solution Mirasol 3B. For more information, please follow other related articles on the PHP Chinese website!