Home  >  Article  >  Technology peripherals  >  [Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture

PHPz
PHPzforward
2023-10-10 13:41:05567browse

1. Brief introduction

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architectureThis paper demonstrates a method for learning highly semantic image representations without relying on hand-crafted data augmentation. The paper introduces the Image-based Joint Embedding Prediction Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: predict the representation of different target patches in the same image from a single context patch. The core design choice guiding I-JEPA to generate semantic representations is the masking strategy; specifically, (a) predict several target patches in the image, (b) sample sample target patches at a sufficiently large scale (15% of the image - 20%), (c) using sufficiently rich (spatially distributed) context blocks is crucial. Empirically, the paper found that I-JEPA is highly scalable when combined with a visual transformer. For example, the paper trains a ViT-Huge/16 on ImageNet in 38 hours using 32 A100 GPUs to achieve strong downstream performance across a wide range of tasks requiring different levels of abstraction, from linear classification to object counting and depth prediction.

2. Research background

In computer vision, there are two common image self-supervised learning methods.

Invariance-based methods and generation methods. By optimizing the encoder through an invariance-based pre-training approach, similar embeddings can be generated for two or more views of the same image. Typically, image views are constructed using a set of handcrafted data augmentation methods, such as random scaling, cropping, color dithering, etc. These pre-training methods can generate high-semantic-level representations, but at the same time they also introduce strong biases that may have a negative impact on some downstream tasks or even pre-training tasks with different data distributions

Cognitive learning theory believes that, One driving mechanism behind representation learning in biological systems is how to adapt an internal model to predict responses to sensory input. This idea is at the heart of self-supervised generative methods, which remove or corrupt parts of the input and learn to predict what is corrupted. In particular, mask denoising methods learn representations by reconstructing random mask patches from the pixel or token level of the input. Compared with view-invariant methods, the pre-training task of masks requires less prior knowledge and is easily generalized beyond image modalities. However, the resulting representations often have lower semantic levels and lack invariance-based pre-training in off-the-shelf evaluations such as linear probing and transfer settings with limited supervision on semantic classification tasks. Therefore, a more sophisticated adaptation mechanism (e.g., end-to-end fine-tuning) is required to obtain the full advantages of these methods.

In this work, the paper explores how to improve the semantic level of self-supervised representations without using additional prior knowledge to encode image transformations. To this end, the paper introduces an image joint embedding prediction architecture (I-JEPA). Figure 3 provides an illustration of this approach. The idea behind I-JEPA is to predict missing information in an abstract representation space; for example, given a context patch, predict the representation of different target patches in the same image, where the target representation is computed by a learned target encoder network.

Compared with generative methods that predict in pixel/marker space, I-JEPA utilizes abstract prediction targets that may eliminate unnecessary pixel-level details, resulting in the model learning more semantic features. Another core design choice guiding I-JEPA to produce semantic representations is the proposed multi-block masking strategy. Specifically, the paper demonstrates the importance of using an informative (spatially distributed) context patch to predict several target patches (of sufficiently large scale) in an image. Rewritten content: Compared with generative methods that predict in pixel/marker space, I-JEPA utilizes abstract prediction targets, potentially eliminating unnecessary pixel-level details, thereby enabling the model to learn more semantic features. Another core design choice of I-JEPA is to adopt a multi-block masking strategy to generate semantic representations. Specifically, the paper demonstrates the importance of using informative (spatially distributed) context patches to predict several target patches (of sufficiently large scale) in an image. Based on extensive empirical evaluation, research Show:

I-JEPA learns powerful off-the-shelf semantic representations without using hand-crafted view enhancements (Figure 1). I-JEPA outperforms pixel reconstruction methods such as MAE on ImageNet-1K linear detection, semi-supervised 1% ImageNet-1K, and semantic transfer tasks.

I-JEPA is competitive with view-invariant pre-training methods on semantic tasks and achieves better performance on low-level vision tasks such as object counting and depth prediction. By using a simpler model and less rigid inductive bias, I-JEPA is applicable to a wider set of tasks.

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architectureI-JEPA is also scalable and efficient. Pre-training ViT-H/14 on ImageNet takes approximately 2400 GPU hours, which is 50% faster than ViTB/16 pre-trained with iBOT and 140% faster than ViT-L/16 pre-trained with MAE. Predictions in representation space significantly reduce the total computation required for self-supervised pre-training.

Self-supervised learning is a method of representation learning in which a system learns to capture relationships between its inputs. This goal can be easily described using the framework of energy-based models (EBMs), where the goal of self-supervision is to assign high energy to incompatible inputs and low energy to compatible inputs. Many existing generative and non-generative self-supervised learning methods can indeed be converted in this framework; see Figure 2

after rewriting: Joint-Embedding Architectures are A pre-training method based on invariance, which can be used in the EBM framework to perform forced conversion, see Figure 2a. The learning goal of the joint embedding architecture is to make compatible inputs x and y output similar embeddings, while incompatible inputs output different embeddings. In image-based pre-training, compatible x and y pairs are typically constructed by randomly applying hand-crafted data augmentations to the same input images. The main challenge of JEA is representation collapse, where the energy landscape is flat (i.e., the encoder produces a constant output regardless of the input). In the past few years, several methods have been studied to prevent representation collapse, such as contrastive losses that explicitly push negative example embeddings, non-contrastive losses that minimize the information redundancy of embeddings, and clustering-based methods to maximize the average Embedded entropy. There are also some heuristic methods that use asymmetric architectural design between x encoder and y encoder to avoid collapse. Generative Architectures. Reconstruction-based self-supervised learning methods can also be cast in EBM frameworks using generative architectures; see Figure 2b

Generative architectures learn to directly reconstruct signal y from a compatible signal x, using an additional A decoder network of (possibly latent) variables z to facilitate reconstruction. In image-based pre-training, a common approach in computer vision is to use masks to generate compatible x,y pairs, where x is a copy of image y, but with some patches masked. The conditioning variable z then corresponds to a set of (possibly learnable) masks and position markers that specify the decoder of the image patch to be reconstructed. As long as the information capacity of z is lower than the signal y, these architectures do not focus on representation collapse.

Joint-Embedding Predictive Architectures. As shown in Figure 2c, the joint embedding prediction architecture is conceptually similar to the generative architecture; however, a key difference is that the loss function is applied to the embedding space rather than the input space. JEPA learns to predict the embedding of signal y from a compatible signal x, using a prediction network of additional (possibly latent) variables z to facilitate prediction. The proposed I-JEPA provides an instantiation of this architecture in the context of images using masks; see Figure 3. In contrast to joint embedding architectures, JEPA does not seek representations that are invariant to a set of handcrafted data augmentations, but rather representations that predict each other when additional information z-conditions are present. However, as with joint embedding architectures, representation collapse is also a concern for JEPA. The paper exploits an asymmetric architecture between x and y encoders to avoid representation collapse in I-JEPA.

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture 3. Method introduction

The paper now describes the proposed image-based joint embedding prediction architecture (I-JEPA), as shown in Figure 3 Show. The overall goal is as follows: given a context patch, predict the representation of different target patches in the same image. The paper uses the Visual Transformer (ViT) architecture as the context encoder, target encoder and predictor. A ViT consists of a stack of Transformer layers, each of which consists of a self-attention operation and a fully connected MLP. The paper's encoder/predictor architecture is reminiscent of the generative mask autoencoder (MAE) approach. However, a key difference is that the I-JEPA method is non-generative and predictions are made in the representation space.

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture##Image Classification[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture

To demonstrate that I-JEPA learns high-level representations without relying on hand-crafted data augmentation, the paper reports results on various image classification tasks using linear detection and partial fine-tuning protocols. In this section, the paper considers self-supervised models pre-trained on the ImageNet-1K dataset. See Appendix A for implementation details of pre-training and assessment. All I-JEPA models are trained in resolution 224×224 unless explicitly stated otherwise.

ImageNet-1K. Table 1 shows the performance on the common ImageNet-1K linear evaluation benchmark. After self-supervised pre-training, the model weights are frozen and a linear classifier is trained on top using the full ImageNet-1K training set. Compared to popular masked autoencoders (MAE) and data2vec methods, which also do not rely on extensive hand-crafted data augmentation before training, the paper sees that I-JEPA significantly improves linear detection performance while using less amount of calculation. Additionally, I-JEPA benefits from scale. ViT-H/16 trained at resolution 448 matches the performance of view-invariant methods such as iBOT without requiring additional manual data augmentation.

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture

Low sample size ImageNet-1K. Table 2 shows the performance on the 1% ImageNet benchmark. These methods utilize pre-trained models for ImageNet classification, using only 1% of ImageNet labels, with approximately 12 or 13 images per category. The model is tuned via fine-tuning or linear probing, depending on what works best for each method. When using similar encoder architecture, I-JEPA outperforms MAE and requires fewer pre-training epochs. The performance of I-JEPA using the ViTH/14 architecture is comparable to ViT-L/16 pre-trained using data 2vec, but the computational load is significantly less. By increasing the image input resolution, I-JEPA performs better than previous methods, including joint embedding methods and leveraging additional hand-crafted data augmentation methods before training, such as MSN, DINO and iBOT

Transfer learning. Table 3 shows the performance on various downstream image classification tasks using linear probes. I-JEPA significantly outperforms previous methods that do not use augmentation (MAE and Data2vec) and reduces the gap with the best methods that leverage handcrafted viewpoint-invariant before training, even surpassing the popular ones on CIFAR100 and Place205 DINO.
[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture

5. Local Prediction Tasks

I-JEPA learns semantic image representation, which significantly improves the downstream image classification performance of previous methods, such as MAE and data2vec. Furthermore, I-JEPA benefits from scale and can close the gap and even beyond, leveraging additional hand-crafted data augmentation of view invariance-based methods. In this section, we find that I-JEPA can also learn local image features and outperform view invariance-based methods in low-level and intensive prediction tasks such as object counting and depth prediction.

Table 4 shows the performance of various low-level tasks using linear probing. In particular, after pre-training, the model's weights are frozen and a linear model is trained on top for object counting and depth prediction on the Clevr dataset. Compared with view-invariant methods such as DINO and iBOT, the I-JEPA method effectively captures low-level image features before training and outperforms in object counting (Clevr/Count) and (largely) depth prediction (Clevr/Dist). to them. [Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture6. Scalability

The rewritten content is as follows: Based on comparison with previous methods, I-JEPA is highly scalable in terms of model efficiency. Figure 5 shows the semi-supervised results of the GPU-hour evaluation on 1% of ImageNet-1K. I-JEPA requires less computation than previous methods and achieves strong performance without relying on manual data augmentation. Compared to reconstruction-based methods such as MAE that directly uses pixels as targets, I-JEPA introduces additional overhead by computing targets in the representation space (approximately 7% slower per iteration)

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architectureScaling data size. The paper also finds that I-JEPA benefits from pre-training on a larger data set. Table 5 shows the transfer learning performance on semantic tasks and low-level tasks when increasing the size of the pre-training dataset (IN1K vs IN22K). Transfer learning performance on these conceptually distinct tasks improves when pre-trained on larger and more diverse datasets. Scaling model size. Table 5 also shows that I-JEPA benefits from larger model size when pre-trained on IN22K. Compared with the ViT-H/14 model, pre-training on ViT-G/16 significantly improves downstream performance on image classification tasks such as Place205 and INat18. The ViTG/16 model does not improve performance on low-level downstream tasks. ViT-G/16 uses a larger input patch size, which may be detrimental to local prediction tasks.

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture

7. Predictor Visualizations can be rewritten

The function of the predictor in I-JEPA is to take the output of the context encoder and mask it with the position The mask token is a condition that predicts the representation of the target block at the location specified by the mask token. One question is whether predictors conditioned on position mask tokens are learning to correctly capture position uncertainty in the target. To study this question qualitatively, we visualize the output of the predictor. After pre-training, the paper freezes the weights of the context encoder and predictor, and trains a decoder according to the RCDM framework to map the average pool of the predictor output back to pixel space. Figure 6 shows the decoder output for various random seeds. Features that are common across samples represent the information contained in the average pooled predictor representation. The I-JEPA predictor correctly captures position uncertainty and produces high-level object parts with correct poses (e.g., the back of a bird and the top of a car). Different masses in different samples represent information not contained in the representation. In this case, the I-JEPA predictor discards precise low-level details and background information.

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture

8. The importance of Ablations

Predicting in representation space. Table 7 compares the low-shot performance when computing 1% ImageNet-1K in pixel space and representation space. The paper speculates that a key component of I-JEPA is that the loss is calculated entirely in the representation space, allowing the target encoder to produce abstract prediction targets that eliminate irrelevant pixel-level details. It is clear from Table 7 that prediction in pixel space leads to a significant degradation in linear detection performance.

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture

Rewritten content: The masking strategy has been modified in Table 8. This study reduces the number of target blocks in the multi-block mask strategy proposed in the I-JEPA pre-training process and adjusts the scale of context and target blocks, as shown in Figure 4. We trained I-JEPA for 300 epochs using various multi-block settings and performed performance comparisons on the 1% ImageNet-1K benchmark using linear probes. To summarize, we found that it is very important to predict several relatively large (semantic) target patches, combined with informative (spatially distributed) context patches

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture

Table 6 also performs similar ablations when comparing with other masking strategies. The paper compares with a rasterized masking strategy, where the image is segmented into four large quadrants and the goal is to use one quadrant as context to predict the other three quadrants. The paper also compares traditional block and random masking strategies commonly used for reconstruction-based methods. In block masking, the target is a single image patch and the context is the image complement. In random masking, the target is a random (possibly discontinuous) set of image patches, and the context is the complement of the image. Note that in all considered masking strategies, there is no overlap between context and target blocks. The proposed multi-block masking strategy is the key for I-JEPA to learn semantic representation. Even switching to traditional block masks reduces the performance of ImageNet by more than 24%.

[Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture

9. Conclusion The paper

proposes a method called I-JEPA for learning semantic image representation. The method does not rely on hand-crafted data augmentation. Studies show that by making predictions in representation space, I-JEPA converges faster than pixel reconstruction methods and is able to learn high semantic level representations. Compared with methods based on view invariance, I-JEPA emphasizes the path of learning general representations using joint embedding architectures without relying on hand-crafted view enhancement

Appendix See the original text, original link: https:/ /arxiv.org/abs/2301.08243

The above is the detailed content of [Paper Interpretation] Graph-based self-supervised learning joint embedding prediction architecture. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:jiqizhixin.com. If there is any infringement, please contact admin@php.cn delete