Home  >  Article  >  Technology peripherals  >  New breakthrough in HCP laboratory of Sun Yat-sen University: using causal paradigm to upgrade multi-modal large models

New breakthrough in HCP laboratory of Sun Yat-sen University: using causal paradigm to upgrade multi-modal large models

王林
王林forward
2023-04-12 20:49:021594browse

Sun Yat-sen University’s Human-Computer Intelligence Fusion Laboratory (HCP) has made fruitful achievements in AIGC and multi-modal large models. It has been selected for more than ten articles in the recent AAAI 2023 and CVPR 2023, ranking among global research institutions. the first echelon.

One of the works realizes the use of causal models to significantly improve the controllability and generalization of multi-modal large models in tuning - "Masked Images Are Counterfactual Samples for" Robust Fine-tuning".

New breakthrough in HCP laboratory of Sun Yat-sen University: using causal paradigm to upgrade multi-modal large models

##Link: https://arxiv.org/abs/2303.03052

Using pre-trained large-scale models to fine-tune on downstream tasks is a currently popular deep learning paradigm. In particular, the recent outstanding performance of ChatGPT, a large pre-trained language model, has made this technical paradigm widely recognized. After pre-training with massive data, these large pre-trained models can adapt to the changing data distribution in the real environment, and therefore show strong robustness in general scenarios.

However, when the pre-trained large model is fine-tuned with downstream scenario data to adapt to specific application tasks, in the vast majority of cases these data are singular. Using these data to fine-tune the pre-trained large model will often reduce the robustness of the model, making it difficult to apply based on the pre-trained large model. Especially in terms of visual models, since the diversity of images far exceeds language, the problem of downstream fine-tuning training leading to a decrease in the robustness of vision-related pre-trained large models is particularly prominent.

Previous research methods usually maintain the robustness of the fine-tuned pre-trained model implicitly at the model parameter level through model integration and other methods. However, these works did not analyze the essential reasons why fine-tuning leads to out-of-distribution performance degradation of the model, nor did they clearly solve the above-mentioned problem of reduced robustness after fine-tuning of large models.

This work is based on the cross-modal large model, and analyzes the essential reasons for the robustness loss of the pre-trained large model from the perspective of causality, and accordingly proposes a method that can A fine-tuning training method that significantly improves model robustness. This method enables the model to maintain strong robustness while adapting to downstream tasks, and better meets the needs of practical applications.

Take the cross-modal pre-training large model CLIP (Contrastive Language–Image Pre-training) released by OpenAI in 2021 as an example: CLIP is a contrast-based image-text union The learned cross-modal pre-trained large model is the basis for generative models such as Stable Diffusion. The model is trained on massive multi-source data containing about 400 million image-text pairs, and learns some causal relationships that are robust to distribution changes to a certain extent.

However, when fine-tuning CLIP with single-feature downstream data, it is easy to destroy the causal knowledge learned by the model, because the non-semantic representation and semantic representation of the training image are highly entangled. of. For example, when applying CLIP model transfer to the downstream scenario of “farm,” many of the training images show “cows” in the grass. At this point, fine-tuning training may allow the model to learn to rely on the non-"cow" semantic representation of grass to predict the semantics of the image. However, this correlation is not necessarily true, for example "cows" may also appear on the road. Therefore, after the model is fine-tuned and trained, its robustness will be reduced, and the output results during application may become extremely unstable and lack controllability.

Based on the team’s years of experience in building and training large models, this work re-examines the problem of reduced robustness caused by fine-tuning of pre-trained models from the perspective of causality. Based on causal modeling and analysis, this work proposes a fine-tuning training method that constructs counterfactual samples based on image masks and improves model robustness through mask image learning.

Specifically, to break spurious correlations in downstream training images, this work proposes a class activation map (CAM)-based method to mask and replace the content of specific regions of the image , used to manipulate non-semantic representations or semantic representations of images to generate counterfactual samples. The fine-tuned model can learn to imitate the representation of these counterfactual samples by the pre-trained model through distillation, thereby better decoupling the influence of semantic factors and non-semantic factors, and improving the adaptability to distribution shifts in downstream fields.


New breakthrough in HCP laboratory of Sun Yat-sen University: using causal paradigm to upgrade multi-modal large models

Experiments show that this method can significantly improve the performance of the pre-trained model in downstream tasks, and at the same time improves the robustness compared to Existing fine-tuning training methods for large models have significant advantages.

The important significance of this work is to open up the "black box" inherited by the pre-trained large model from the deep learning paradigm to a certain extent, and to solve the "interpretability" of the large model. and "controllability" issues, bringing us closer to the tangible productivity improvements led by pre-trained large models.

The HCP team of Sun Yat-sen University has been engaged in research on large model technology paradigms for many years since the advent of the Transformer mechanism. It is committed to improving the training efficiency of large models and introducing causal models to solve the "controllable problem" of large models. "sex" issue. Over the years, the team has independently researched and developed multiple large pre-training models for vision, language, speech and cross-modality. The "Wukong" cross-modal large model jointly developed with Huawei's Noah's Ark Laboratory (link: https://arxiv .org/abs/2202.06767) is a typical case.

Team Introduction

Sun Yat-sen University Human-Computer-Object Intelligence Fusion Laboratory (HCP Lab) is engaged in multi-modal recognition It conducts systematic research in the fields of intelligent computing, robotics and embedded systems, metaverse and digital humans, and controllable content generation, and conducts in-depth application scenarios to create product prototypes, output a large number of original technologies, and incubate entrepreneurial teams. The laboratory was founded in 2010 by Professor Lin Liang, IAPR Fellow. It has won the first prize of Science and Technology Award of China Image and Graphics Society, the Wu Wenjun Natural Science Award, the first prize of provincial natural science and other honors; it has trained national-level young talents such as Liang Xiaodan and Wang Keze.

The above is the detailed content of New breakthrough in HCP laboratory of Sun Yat-sen University: using causal paradigm to upgrade multi-modal large models. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete