Home  >  Article  >  Technology peripherals  >  Why is self-monitoring effective? The 243-page Princeton doctoral thesis "Understanding Self-supervised Representation Learning" comprehensively explains the three types of methods: contrastive learning, language modeling and self-prediction.

Why is self-monitoring effective? The 243-page Princeton doctoral thesis "Understanding Self-supervised Representation Learning" comprehensively explains the three types of methods: contrastive learning, language modeling and self-prediction.

PHPz
PHPzforward
2023-04-15 08:13:02932browse

Pre-training has emerged as an alternative and effective paradigm to overcome these shortcomings, where models are first trained using easily available data and then used to solve downstream tasks of interest, with less labeled data than supervised learning Much more.

Pre-training using unlabeled data, i.e. self-supervised learning, is particularly revolutionary and has achieved success in different fields: text, vision, speech, etc.

This raises an interesting and challenging question: Why should pretraining on unlabeled data help seemingly unrelated downstream tasks?

Why is self-monitoring effective? The 243-page Princeton doctoral thesis Understanding Self-supervised Representation Learning comprehensively explains the three types of methods: contrastive learning, language modeling and self-prediction.

Paper address: https://dataspace.princeton.edu/ handle/88435/dsp01t435gh21h

#This paper presents some work that proposes and establishes a theoretical framework to investigate why self-supervised learning is beneficial for downstream tasks.

The framework is suitable for contrastive learning, autoregressive language modeling and self-prediction based methods. The core idea of ​​this framework is that pre-training helps to learn a low-dimensional representation of the data, which subsequently helps solve the downstream tasks of interest with linear classifiers, requiring less labeled data.

A common topic is to formalize the ideal properties of unlabeled data distributions for building self-supervised learning tasks. With appropriate formalization, it can be shown that approximately minimizing the correct pre-training objective can extract downstream signals implicitly encoded in unlabeled data distributions.

Finally, it is shown that the signal can be decoded from the learned representation using a linear classifier, thus providing a formalization for the transfer of "skills and knowledge" across tasks.

Why is self-monitoring effective? The 243-page Princeton doctoral thesis Understanding Self-supervised Representation Learning comprehensively explains the three types of methods: contrastive learning, language modeling and self-prediction.

Introduction

In the quest to design intelligent agents and data-driven solutions to problems In the process, the fields of machine learning and artificial intelligence have made tremendous progress in the past decade. With initial successes on challenging supervised learning benchmarks such as ImageNet [Deng et al., 2009], innovations in deep learning subsequently led to models with superhuman performance on many such benchmarks in different domains. Training such task-specific models is certainly impressive and has huge practical value. However, it has an important limitation in requiring large labeled or annotated datasets, which is often expensive. Furthermore, from an intelligence perspective, one hopes to have more general models that, like humans [Ahn and Brewer, 1993], can learn from previous experiences, summarize them into skills or concepts, and utilize these skills or Concepts to solve new tasks with little or no demonstration. After all, babies learn a lot through observation and interaction without explicit supervision. These limitations inspired an alternative paradigm for pretraining.

#The focus of this article is on pre-training using the often large amounts of available unlabeled data. The idea of ​​using unlabeled data has long been a point of interest in machine learning, particularly through unsupervised and semi-supervised learning. A modern adaptation of this using deep learning is often called self-supervised learning (SSL) and has begun to change the landscape of machine learning and artificial intelligence through ideas such as contrastive learning and language modeling. The idea of ​​self-supervised learning is to construct certain tasks using only unlabeled data, and train the model to perform well on the constructed tasks. Such tasks typically require models to encode structural properties of the data by predicting unobserved or hidden parts (or properties) of the input from observed or retained parts [LeCun and Misra, 2021]. Self-supervised learning has shown generality and utility on many downstream tasks of interest, often with better sample efficiency than solving tasks from scratch, bringing us one step closer to the goal of general-purpose agents. Indeed, recently, large language models like GPT-3 [Brown et al., 2020] have demonstrated fascinating “emergent behavior” that occurs at scale, sparking more interest in the idea of ​​self-supervised pretraining .

Although self-supervised learning has been empirically successful and continues to show great promise, there is still a lack of good theoretical understanding of how it works beyond rough intuition. These impressive successes raise interesting questions because it is unclear a priori why a model trained on one task should help on another seemingly unrelated task, i.e. why training on task a should help Task b. While a complete theoretical understanding of SSL (and deep learning in general) is challenging and elusive, understanding this phenomenon at any level of abstraction may help develop more principled algorithms. The research motivation of this article is:

Why training on self-supervised learning tasks (using a large amount of unlabeled data) helps solve data-scarce downstream tasks? How to transfer "knowledge and skills" Formalized?

Although there is a large amount of literature on supervised learning, generalization from SSL tasks→downstream tasks is fundamentally different from generalization from training sets→test sets in supervised learning. For supervised learning for downstream tasks of classification, for example, a model trained on a training set of input-label pairs sampled from an unknown distribution can be directly used for evaluation on an unseen test set sampled from the same distribution. This basic distribution establishes the connection from the training set to the test set. However, the conceptual connection from SSL task→downstream task is less clear because the unlabeled data used in the SSL task has no clear signal about downstream labels. This means that a model pretrained on an SSL task (e.g., predicting a part of the input from the rest) cannot be directly used on downstream tasks (e.g., predicting a class label from the input). Therefore, the transfer of "knowledge and skills" requires an additional training step using some labeled data, ideally less than what is required for supervised learning from scratch. Any theoretical understanding of SSL task → downstream task generalization needs to address these questions: "What is the intrinsic role of unlabeled data? and "How to use pre-trained models for downstream tasks?" This paper targets the downstream tasks of classification, by Make distribution assumptions on unlabeled data and use the idea of ​​representation learning to study these issues:

(a) (Distribution Assumption) The distribution of unlabeled data implicitly contains relevant Information about downstream classification tasks of interest.

(b) (Representation Learning) A model pretrained on an appropriate SSL task can encode that signal through learned representations that are then Downstream classification tasks can be solved using linear classifiers.

Point (a) shows that certain unlabeled structural properties implicitly provide us with hints about subsequent downstream tasks, and self-supervised learning can help learn from data to tease out this signal. Point (b) proposes a simple and empirically effective way to use pre-trained models, leveraging the model’s learned representations. This paper identifies and mathematically quantifies distributional properties of unlabeled data, demonstrating that good representations can be learned for different SSL methods such as contrastive learning, language modeling, and self-prediction. In the next section, we delve into the idea of ​​representation learning and formally explain why self-supervised learning helps downstream tasks.

Why is self-monitoring effective? The 243-page Princeton doctoral thesis Understanding Self-supervised Representation Learning comprehensively explains the three types of methods: contrastive learning, language modeling and self-prediction.

The above is the detailed content of Why is self-monitoring effective? The 243-page Princeton doctoral thesis "Understanding Self-supervised Representation Learning" comprehensively explains the three types of methods: contrastive learning, language modeling and self-prediction.. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete