search
HomeTechnology peripheralsAIMultimodal self-supervised learning: exploring objective functions, data alignment and model architecture - taking the latest Edinburgh review as an example

Multimodal learning aims to understand and analyze information from multiple modalities, and substantial progress has been made in supervision mechanisms in recent years.

#However, heavy reliance on data combined with expensive manual annotation hinders model scaling. At the same time, given the availability of large-scale unlabeled data in the real world, self-supervised learning has become an attractive strategy to alleviate the labeling bottleneck.

Based on these two directions, self-supervised multimodal learning (SSML) provides a method to exploit supervision from original multimodal data.

Multimodal self-supervised learning: exploring objective functions, data alignment and model architecture - taking the latest Edinburgh review as an example

##Paper address: https ://arxiv.org/abs/2304.01008

##Project address: https://github. com/ys-zong/awesome-self-supervised-multimodal-learning

In this review, we provide a comprehensive review of the state-of-the-art techniques for SSML , we classify along three orthogonal axes: objective function, data alignment, and model architecture. These axes correspond to the inherent characteristics of self-supervised learning methods and multi-modal data.

Specifically, we divide the training objectives into instance discrimination, clustering and mask prediction categories. We also discuss multimodal input data pairing and alignment strategies during training. Finally, the model architecture is reviewed, including the design of encoders, fusion modules, and decoders, which are important components of SSML methods.

Reviews the downstream multi-modal application tasks, reports the specific performance of the state-of-the-art image-text model and multi-modal video model, and also reviews the application of SSML algorithms in different fields Practical applications such as healthcare, remote sensing and machine translation. Finally, challenges and future directions for SSML are discussed.

1. Introduction

Humans

perceive the world through various senses, including vision, hearing, touch and smell. We gain a comprehensive understanding of our surroundings by leveraging complementary information from each modality. AI research has been focused on developing intelligent agents that mimic human behavior and understand the world in a similar way. To this end, the field of multimodal machine learning [1], [2] aims to develop models that can process and integrate data from multiple different modalities. In recent years, multimodal learning has made significant progress, leading to a series of applications in visual and language learning [3], video understanding [4], [5], biomedicine [6], autonomous driving [7] and other fields. More fundamentally, multimodal learning is advancing long-standing grounding problems in artificial intelligence [8], bringing us closer to more general artificial intelligence.

However, multi-modal algorithms often still require expensive manual annotation for effective training, which hinders their expansion. Recently, self-supervised learning (SSL) [9], [10] has begun to alleviate this problem by generating supervision from readily available annotated data. Self-supervision in single-modal learning is fairly well defined and depends only on the training objectives and whether human annotation is used for supervision. However, in the context of multimodal learning, its definition is more nuanced. In multimodal learning, one modality often acts as a supervisory signal for another modality. In terms of the goal of upward scaling by eliminating the manual annotation bottleneck, a key issue in defining the scope of self-supervision is whether cross-modal pairings are freely acquired.

Self-supervised multimodal learning (SSML) significantly enhances the capabilities of multimodal models by leveraging freely available multimodal data and self-supervised objectives.

In this review, we review the SSML algorithm and its applications. We decompose the various methods along three orthogonal axes: objective function, data alignment, and model architecture. These axes correspond to the characteristics of self-supervised learning algorithms and the specific considerations required for multimodal data. Figure 1 provides an overview of the proposed taxonomy. Based on the pre-task, we divide the training objectives into instance discrimination, clustering and mask prediction categories. Hybrid approaches that combine two or more of these approaches are also discussed.

Unique to multimodal self-supervision is the problem of multimodal data pairing. Pairings, or more generally alignments, between modalities can be exploited by SSML algorithms as input (e.g. when one modality is used to provide supervision for another), but also as output (e.g. , learns from unpaired data and induces pairing as a by-product). We discuss the different roles of alignment at coarse-grained levels that are often assumed to be freely available in multimodal self-supervision (e.g., web-crawled images and captions [11]); sometimes explicitly or Implicitly induced fine-grained alignment (e.g., correspondence between title words and image patches [12]). Additionally, we explore the intersection of objective functions and data alignment assumptions.

also analyzes the design of contemporary SSML model architecture. Specifically, we consider the design space of encoder and fusion modules, comparing mode-specific encoders (without fusion or with late fusion) and unified encoders with early fusion. We also examine architectures with specific decoder designs and discuss the impact of these design choices.

Finally, the applications of these algorithms in multiple real-world domains, including healthcare, remote sensing, machine translation, etc., are discussed, and the technical challenges and social impacts of SSML are discussed in depth. , indicating potential future research directions. We summarize recent advances in methods, datasets, and implementations to provide a starting point for researchers and practitioners in the field.

Existing review papers either only focus on supervised multimodal learning [1], [2], [13], [14], or single modality Self-supervised learning [9], [10], [15], or a certain sub-area of ​​SSML, such as visual-linguistic pre-training [16]. The most relevant review is [17], but it focuses more on temporal data and ignores the key considerations of multi-modal self-supervision of alignment and architecture. In contrast, we provide a comprehensive and up-to-date overview of SSML algorithms and provide a new taxonomy covering algorithms, data, and architecture.

Multimodal self-supervised learning: exploring objective functions, data alignment and model architecture - taking the latest Edinburgh review as an example

2. Background knowledge

Self-supervision in multi-modal learning

We first describe the scope of SSML considered in this survey, as this term has been used inconsistently in previous literature. Defining self-supervision in a single-modal context is more straightforward by invoking the label-free nature of different pretext tasks, e.g., the well-known instance discrimination [20] or the masked prediction target [21] implement self-supervision. In contrast, the situation in multimodal learning is more complicated because the roles of modality and label become blurred. For example, in supervised image captioning [22], text is usually treated as a label, but in self-supervised multi-modal visual and language representation learning [11], text is treated as an input modality.

In the multimodal context, the term self-supervision has been used to refer to at least four situations: (1) Label-free learning from automatically paired multimodal data— — such as movies with video and audio tracks [23], or image and depth data from RGBD cameras [24]. (2) Learning from multimodal data, in which one modality has been manually annotated, or two modalities have been manually paired, but this annotation has been created for a different purpose, and therefore can be considered free for SSML pre-training. For example, matching image-caption pairs scraped from the web, as used in the seminal CLIP [11], is actually an example of supervised metric learning [25], [26] where the pairing is supervised. However, since both patterns and pairings are freely available at scale, it is often described as self-supervised. This uncurated, incidentally created data is often of lower quality and noisier than specially curated datasets such as COCO [22] and Visual Genome [27]. (3) Learn from high-quality purpose-annotated multi-modal data (e.g., manually captioned images in COCO [22]), but with a self-supervised style objective such as Pixel-BERT [28]. (4) Finally, there are “self-supervised” methods that use a mixture of free and manually labeled multi-modal data [29], [30]. For the purpose of this investigation, we follow the idea of ​​self-supervision and aim to scale up by breaking the bottleneck of manual annotation. Therefore, we include the first two categories and the fourth category of methods in terms of being able to train on freely available data. We exclude methods shown only for manually curated datasets because they apply typical “self-supervision” objectives on curated datasets (e.g., masked prediction).

Multimodal self-supervised learning: exploring objective functions, data alignment and model architecture - taking the latest Edinburgh review as an example

(a) Supervised multi-modal learning and (b) Self-supervised Learning paradigm of multi-modal learning: self-supervised pre-training without manual annotation (top); supervise and fine-tune downstream tasks (bottom).

3. Objective function

In this section, we will introduce the objective function used to train three types of self-supervised multi-modal algorithms: instance discrimination , clustering and masking predictions. Finally we also discussed hybrid targets.

3.1 Instance discrimination

In single-mode learning, instance discrimination (ID) converts the original data into Each instance in is treated as a separate class, and the model is trained to distinguish between different instances. In the context of multimodal learning, instance discrimination usually aims to determine whether samples from two input modalities are from the same instance, i.e., paired. By doing so, it attempts to align the representation space of pairs of patterns while pushing the representation space of different pairs of instances further apart. There are two types of instance recognition goals: contrastive prediction and matching prediction, depending on how the input is sampled.

Multimodal self-supervised learning: exploring objective functions, data alignment and model architecture - taking the latest Edinburgh review as an example

##3.2 Clustering

The clustering method assumes that the trained End-to-end clustering will result in grouping data based on semantically salient features. In practice, these methods iteratively predict cluster assignments of encoded representations and use these predictions (also known as pseudo-labels) as supervisory signals to update feature representations. Multimodal clustering provides the opportunity to learn multimodal representations and also improve traditional clustering by supervising other modalities using pseudo-labels for each modality.

3.3 Mask prediction

The mask prediction task can use automatic encoding (similar to BERT[101]) or autoregression method (similar to GPT [102]) to perform.

Multimodal self-supervised learning: exploring objective functions, data alignment and model architecture - taking the latest Edinburgh review as an example

Multimodal self-supervised learning: exploring objective functions, data alignment and model architecture - taking the latest Edinburgh review as an example

Multimodal self-supervised learning: exploring objective functions, data alignment and model architecture - taking the latest Edinburgh review as an example##

The above is the detailed content of Multimodal self-supervised learning: exploring objective functions, data alignment and model architecture - taking the latest Edinburgh review as an example. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
Gemma Scope: Google's Microscope for Peering into AI's Thought ProcessGemma Scope: Google's Microscope for Peering into AI's Thought ProcessApr 17, 2025 am 11:55 AM

Exploring the Inner Workings of Language Models with Gemma Scope Understanding the complexities of AI language models is a significant challenge. Google's release of Gemma Scope, a comprehensive toolkit, offers researchers a powerful way to delve in

Who Is a Business Intelligence Analyst and How To Become One?Who Is a Business Intelligence Analyst and How To Become One?Apr 17, 2025 am 11:44 AM

Unlocking Business Success: A Guide to Becoming a Business Intelligence Analyst Imagine transforming raw data into actionable insights that drive organizational growth. This is the power of a Business Intelligence (BI) Analyst – a crucial role in gu

How to Add a Column in SQL? - Analytics VidhyaHow to Add a Column in SQL? - Analytics VidhyaApr 17, 2025 am 11:43 AM

SQL's ALTER TABLE Statement: Dynamically Adding Columns to Your Database In data management, SQL's adaptability is crucial. Need to adjust your database structure on the fly? The ALTER TABLE statement is your solution. This guide details adding colu

Business Analyst vs. Data AnalystBusiness Analyst vs. Data AnalystApr 17, 2025 am 11:38 AM

Introduction Imagine a bustling office where two professionals collaborate on a critical project. The business analyst focuses on the company's objectives, identifying areas for improvement, and ensuring strategic alignment with market trends. Simu

What are COUNT and COUNTA in Excel? - Analytics VidhyaWhat are COUNT and COUNTA in Excel? - Analytics VidhyaApr 17, 2025 am 11:34 AM

Excel data counting and analysis: detailed explanation of COUNT and COUNTA functions Accurate data counting and analysis are critical in Excel, especially when working with large data sets. Excel provides a variety of functions to achieve this, with the COUNT and COUNTA functions being key tools for counting the number of cells under different conditions. Although both functions are used to count cells, their design targets are targeted at different data types. Let's dig into the specific details of COUNT and COUNTA functions, highlight their unique features and differences, and learn how to apply them in data analysis. Overview of key points Understand COUNT and COU

Chrome is Here With AI: Experiencing Something New Everyday!!Chrome is Here With AI: Experiencing Something New Everyday!!Apr 17, 2025 am 11:29 AM

Google Chrome's AI Revolution: A Personalized and Efficient Browsing Experience Artificial Intelligence (AI) is rapidly transforming our daily lives, and Google Chrome is leading the charge in the web browsing arena. This article explores the exciti

AI's Human Side: Wellbeing And The Quadruple Bottom LineAI's Human Side: Wellbeing And The Quadruple Bottom LineApr 17, 2025 am 11:28 AM

Reimagining Impact: The Quadruple Bottom Line For too long, the conversation has been dominated by a narrow view of AI’s impact, primarily focused on the bottom line of profit. However, a more holistic approach recognizes the interconnectedness of bu

5 Game-Changing Quantum Computing Use Cases You Should Know About5 Game-Changing Quantum Computing Use Cases You Should Know AboutApr 17, 2025 am 11:24 AM

Things are moving steadily towards that point. The investment pouring into quantum service providers and startups shows that industry understands its significance. And a growing number of real-world use cases are emerging to demonstrate its value out

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Chat Commands and How to Use Them
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function