


Small scale, high efficiency: DeepMind launches multi-modal solution Mirasol 3B
One of the main challenges faced by multi-modal learning is the need to fuse heterogeneous modalities such as text, audio, and video. Multi-modal models need to combine signals from different sources. However, these modalities have different characteristics and are difficult to combine through a single model. For example, video and text have different sampling rates
Recently, a research team from Google DeepMind decoupled multi-modal models into multiple independent, specialized autoregressive models, according to Features of various modalities to process input.
Specifically, the study proposes a multimodal model called Mirasol3B. Mirasol3B consists of time-synchronized autoregressive components for audio and video as well as autoregressive components for contextual modalities. These modes are not necessarily aligned in time, but are arranged in order
Paper address: https://arxiv.org/abs/2311.05698
Mirasol3B reaches SOTA level in multi-modal benchmarks, outperforming larger models. By learning more compact representations, controlling the sequence length of audio-video feature representations, and modeling based on temporal correspondences, Mirasol3B is able to effectively meet the high computational requirements of multi-modal inputs.
Method Introduction
Mirasol3B is an audio-video-text multimodal model where autoregressive modeling is decoupled into temporal alignment Autoregressive components for modalities (e.g. audio, video), and autoregressive components for non-temporally aligned contextual modalities (e.g. text). Mirasol3B uses cross-attention weights to coordinate the learning process of these components. This decoupling makes the parameter distribution within the model more reasonable, allocates enough capacity to the modalities (video and audio), and makes the overall model more lightweight.
As shown in Figure 1, Mirasol3B consists of two main learning components: the autoregressive component and the input combination component. Among them, the autoregressive component is designed to handle nearly simultaneous multi-modal inputs, such as video and audio, for timely input combinations
When rewriting the content, you need to keep the original meaning unchanged and change the language to Chinese. The study proposes to segment the temporally aligned modalities into time segments and learn audio-video joint representations in the time segments. Specifically, this research proposes a modal joint feature learning mechanism called "Combiner". "Combiner" fuses modal features within the same time period to generate a more compact representation
"Combiner" extracts a primary spatiotemporal representation from the original modal input and captures the video The dynamic characteristics, combined with its synchronic audio features, the model can receive multi-modal input at different rates and perform well when processing longer videos.
"Combiner" effectively meets the need for modal representation to be both efficient and informative. It can fully cover events and activities in video and other concurrent modalities, and can be used in subsequent autoregressive models to learn long-term dependencies.
In order to process video and audio signals and accommodate longer video/audio inputs, they are split into (roughly synchronized in time) Small pieces, and then learn joint audio-visual representation through "Combiner". The second component handles context, or temporally misaligned signals such as global textual information, which are often still continuous. It is also autoregressive and uses the combined latent space as cross-attention input.
The learning component contains video and audio, and its parameter is 3B; while the component without audio is 2.9B. Among them, most parameters are used in audio and video autoregressive models. Mirasol3B usually processes 128-frame videos, and can also process longer videos, such as 512 frames.
Due to the design of the partition and "Combiner" model architecture, add more frames, or increase The size and number of blocks, etc., will only increase the parameters slightly, solving the problem that longer videos require more parameters and larger memory.
Experiments and Results
This study tested and evaluated Mirasol3B on the standard VideoQA benchmark, long video VideoQA benchmark, and audio video benchmark.
The test results on the VideoQA data set MSRVTTQA are shown in Table 1 below. Mirasol3B surpasses the current SOTA model, as well as larger models such as PaLI-X and Flamingo.
In terms of long video question and answer, this study tested and evaluated Mirasol3B on the ActivityNet-QA and NExTQA data sets. The results are shown in Table 2 below. Display:
In the end, the study selected KineticsSound, VGG-Sound, and Epic-Sound for audio-video benchmarking and adopted an open Generate evaluation. The experimental results are shown in Table 3 below:
Interested readers can read the original text of the paper to learn more about the research content.
The above is the detailed content of Small scale, high efficiency: DeepMind launches multi-modal solution Mirasol 3B. For more information, please follow other related articles on the PHP Chinese website!

The term "AI-ready workforce" is frequently used, but what does it truly mean in the supply chain industry? According to Abe Eshkenazi, CEO of the Association for Supply Chain Management (ASCM), it signifies professionals capable of critic

The decentralized AI revolution is quietly gaining momentum. This Friday in Austin, Texas, the Bittensor Endgame Summit marks a pivotal moment, transitioning decentralized AI (DeAI) from theory to practical application. Unlike the glitzy commercial

Enterprise AI faces data integration challenges The application of enterprise AI faces a major challenge: building systems that can maintain accuracy and practicality by continuously learning business data. NeMo microservices solve this problem by creating what Nvidia describes as "data flywheel", allowing AI systems to remain relevant through continuous exposure to enterprise information and user interaction. This newly launched toolkit contains five key microservices: NeMo Customizer handles fine-tuning of large language models with higher training throughput. NeMo Evaluator provides simplified evaluation of AI models for custom benchmarks. NeMo Guardrails implements security controls to maintain compliance and appropriateness

AI: The Future of Art and Design Artificial intelligence (AI) is changing the field of art and design in unprecedented ways, and its impact is no longer limited to amateurs, but more profoundly affecting professionals. Artwork and design schemes generated by AI are rapidly replacing traditional material images and designers in many transactional design activities such as advertising, social media image generation and web design. However, professional artists and designers also find the practical value of AI. They use AI as an auxiliary tool to explore new aesthetic possibilities, blend different styles, and create novel visual effects. AI helps artists and designers automate repetitive tasks, propose different design elements and provide creative input. AI supports style transfer, which is to apply a style of image

Zoom, initially known for its video conferencing platform, is leading a workplace revolution with its innovative use of agentic AI. A recent conversation with Zoom's CTO, XD Huang, revealed the company's ambitious vision. Defining Agentic AI Huang d

Will AI revolutionize education? This question is prompting serious reflection among educators and stakeholders. The integration of AI into education presents both opportunities and challenges. As Matthew Lynch of The Tech Edvocate notes, universit

The development of scientific research and technology in the United States may face challenges, perhaps due to budget cuts. According to Nature, the number of American scientists applying for overseas jobs increased by 32% from January to March 2025 compared with the same period in 2024. A previous poll showed that 75% of the researchers surveyed were considering searching for jobs in Europe and Canada. Hundreds of NIH and NSF grants have been terminated in the past few months, with NIH’s new grants down by about $2.3 billion this year, a drop of nearly one-third. The leaked budget proposal shows that the Trump administration is considering sharply cutting budgets for scientific institutions, with a possible reduction of up to 50%. The turmoil in the field of basic research has also affected one of the major advantages of the United States: attracting overseas talents. 35

OpenAI unveils the powerful GPT-4.1 series: a family of three advanced language models designed for real-world applications. This significant leap forward offers faster response times, enhanced comprehension, and drastically reduced costs compared t


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Dreamweaver CS6
Visual web development tools

SublimeText3 Linux new version
SublimeText3 Linux latest version

SublimeText3 Chinese version
Chinese version, very easy to use
