The first domain adaptation strategy for the "Segment Anything" large model is here! Related papers have been accepted by CVPR 2024.
##大The success of language models (LLMs) has stimulated interest in exploring basic models for segmentation in the field of computer vision. These basic segmentation models are usually used for zero/few image segmentation through Prompt Engineer. Among them, Segment Anything Model (SAM) is the most advanced basic model for image segmentation.
## However, recent research shows that SAM is not very robust and generalizable in a variety of downstream tasks, such as poor performance in medical images, camouflaged objects, and natural images with added interference. This may be due to a large
Domain Shift between the training data set and the downstream test data set. Therefore, a very important question is, how to design a domain adaptation scheme to make SAM more robust in facing the real world and diverse downstream tasks?
There are three main challenges in adapting pre-trained SAM to downstream tasks:
First of all, the traditional unsupervised domain adaptation paradigm requires
source dataset - and target dataset, which is relatively unfeasible due to privacy and computational cost. Secondly, for domain adaptation, updating all weights usually performs better, but is also limited by
expensive memory costs- . Finally, SAM can show diverse segmentation capabilities for prompts of different types and granularity, so
when there is a lack of prompt information for downstream tasks- , unsupervised adaptation will be very challenging.
# . We use weak supervision to adapt SAM on various downstream tasks
To address the above challenges, we propose a method with
Weakly supervised self-training architecture of anchor point regularization
and low-rank fine-tuning to improve the adaptive robustness and computational efficiency. Specifically, we first adopt a self-training strategy in the passive domain to avoid dependence on source data. Self-training generates pseudo-labels, which are used to supervise model updates, but they are easily affected by incorrect pseudo-labels. We introduce
frozen source models as anchor networks
to standardize model updates. To further reduce the high computational cost of updating the full model weights, we apply
low-rank weight decomposition
to the encoder and proceed via a low-rank shortcut path Backpropagation. Finally, in order to further improve the effect of passive domain adaptation, we introduce
weak supervise
in the target domain, such as sparse dot annotation to provide stronger domain adaptation information, while this weak supervision is naturally compatible with the cue encoder in SAM. With weak supervision as Prompt, we obtain more local and explicit self-trained pseudo-labels. The tuned model shows stronger generalization ability on multiple downstream tasks.
We summarize the contributions of this work as follows:
1. We suffer from the generalization problem of SAM in downstream tasks Inspired by , a solution that is task-agnostic and does not require source data is proposed to adapt SAM through self-training.
2. We use weak supervision, including box, point and other labels, to improve the adaptive effect. These weakly supervised labels are fully compatible with SAM's prompt encoder. 3. We conduct extensive experiments on 5 types of downstream instance segmentation tasks to demonstrate the effectiveness of the proposed weakly supervised adaptive method.
- Paper address: https://arxiv.org/pdf/2312.03502.pdf
- Project address: https://github.com/Zhang- Haojie/WeSAM
- Paper title: Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation
The method introduction is divided into four parts:
- Adaptive framework based on self-training
-
How weak supervision helps achieve effective self-training
-
##1.Segment Anything ModelSAM is mainly composed of three components: Image Encoder (ImageEncoder), Prompt Encoder (PromptEncoder), and Decoder (MaskDecoder) . The image encoder is pre-trained using MAE. The entire SAM is further fine-tuned on the training set SA-1B with 1.1 billion annotations. Focal loss and Dice are used during training. combination of loss. At inference time, a test image x is first encoded by an image encoder, and then given a prompt, a lightweight decoder makes three levels of predictions. 2.Source-Free Domain Adaptation Self-Training
Figure 2 The proposed self-training architecture with anchor network regularization and contrastive loss regularization For target datasets where no labels are provided DT={xi} and pre-trained segmentation model. We use the student-teacher architecture for self-training. As shown in Figure 2, we maintain three encoder networks, namely anchor model, student model, and teacher model, where the student and teacher models share weights. Specifically, for each sample xi, apply a random weak data enhancement as the input of the anchor and teacher models, and apply a random strong data enhancement as the student model As input, three encoder networks encode to produce three feature maps. In the decoder network, given a certain number Np of prompts, such as box, point or coarse mask, a set of instance segmentation masks will be inferred. #Based on the above knowledge, we elaborate on the three sets of optimization objectives for self-training below. 1) Student-Teacher self-trainingWe first use the same loss function as the self-training when training SAM Train the optimization objective to update the student/teacher model. Self-training is widely used in semi-supervised learning and has recently been shown to be very effective for passive domain adaptation. Specifically, we use the prediction results generated by the teacher model as pseudo labels, and use Focal loss and Dice loss to supervise the student output. 2) Anchor loss for robust regularizationNetwork training using only self-training loss is susceptible to The effect of the accumulation of false pseudo-labels predicted by the teacher network, the so-called confirmation bias. Observations also show that performance degrades after long iterations using only self-training. Existing passive domain adaptation methods often employ additional constraints to prevent the negative effects of self-training, such as uniform distribution of predictions. We regularize through anchor loss, as shown in Formula 3, minimizes the Dice loss between anchor model and student/teacher model respectively. The frozen anchor model, as knowledge inherited from the source domain, discourages excessive deviations between the source model and the self-training update model, and can prevent model collapse. 3) Contrast loss regularized encoder feature space
以上两个训练目标is performed in the output space of the decoder. The experimental section reveals that updating the encoder network is the most efficient way to adapt SAM, so it is necessary to directly apply regularization to the features output from the encoder network. As shown in Figure 3, we crop the features of each instance from the feature map based on the predicted mask in the anchor and teacher branches. We further define the positive and negative sample pairs in the contrastive loss. The positive sample pairs are constructed from the instance features corresponding to the same prompt in the two branches, and the negative sample pairs It is constructed by the instance characteristics corresponding to different prompts. The final contrast loss is shown below, where is the temperature coefficient. 4) Total lossWe combine the above three loss functions into the final Source-Free adaptation loss. 3. Self-trained Prompt generationSAM segmentation requires Prompt input to indicate the target object to be segmented, but there may be particles. A vague question. Prompt projects can be implemented in a fully automated manner or through human interaction. 1) Completely automatically generate promptWe first use grid dense sampling points as prompt input, through Anchor The model generates masks for segmentation in the initial stage, eliminates masks with low IoU and stability scores, and then performs non-maximum suppression to obtain the segmentation results. Next, a fixed set of prompts is generated from the final masks as prompt input for all three branches. Therefore, the mask lengths of the three network segmentation outputs are the same and have an exact one-to-one correspondence. 2) Weak supervision as promptsAlthough prompts can be obtained by using grid sampling on the image, and Filter out low-quality and duplicate masks for automatic segmentation. But these segmentations are of relatively poor quality, may contain many false positive predictions, and have unclear granularity. The resulting prompt quality is uneven, making self-training less effective. Therefore, drawing on previous weakly supervised domain adaptation work, we propose to use three weakly supervised methods, including bounding box box, sparse point annotation point and coarse segmentation polygon coarse mask. In SAM, these weak supervision methods perfectly match prompt input, and weak supervision can be seamlessly integrated to adapt to SAM. 4. Low-rank weight updateThe huge encoder network of the basic model makes It becomes extremely difficult to update the weights of all models. However, many existing studies show that updating the encoder network weights is an effective way to tune pre-trained models. #To be able to update the encoder network more efficiently and cost-effectively, we choose a computationally friendly low-rank update method. For each weight θ in the encoder network, we use a low-rank approximation ω = AB and set a compression ratio r. Only A and B are updated via backpropagation to reduce memory usage. During the inference phase, the weights are reconstructed by combining the low-rank approximation with the original weights, i.e., θ = θ AB. In the experiments, we provide detailed details with the state-of-the-art methods Comparative and qualitative results. Finally, we analyze the effectiveness of each part and the specific design of the network. In this work, we Different types of downstream segmentation tasks are evaluated, some of which have significant distribution shifts from SA-1B. The dataset covers clear natural images, natural images with added interference, medical images, camouflaged objects and robot images, a total of 10 types. Data partitioning: Each downstream data set is divided into non-overlapping training sets and test sets. The datasets on which each type of downstream task was evaluated are listed in Table 1, along with the split of the training and test datasets.
Segment-Anything model: Due to memory limitations, we ViT-B is adopted as the encoder network. Use standard hint encoder and mask decoder. Prompt generation: Prompt inputs for both training and evaluation phases are computed from instance segmentation GT masks, simulating human interaction as weak supervision. #Specifically, we extract the box from the minimum bounding box of the entire GT mask. Points are created by randomly selecting 5 positive sample points within the GT mask and 5 negative sample points outside the mask. Coarse masks are simulated by fitting polygons to GT masks. Tables 2, 3, 4, and 5 are respectively Test results on natural images with added interference, clear natural images, medical images, and camouflaged object data sets. The complete experimental results can be found in the paper. Experiments demonstrate that our scheme outperforms pre-trained SAM and state-of-the-art domain adaptation schemes on almost all downstream segmentation datasets.
Part of the visualization results are as follows As shown in Figure 4, more visualization results can be found in the paper.
Figure 4 Visualized results of some examples5. Ablation experiments and additional analysisWe analyzed the effectiveness of each of the three self-training optimization objectives on the COCO data set, as shown in Table 7. In Table 7, we also analyze the effect of the proposed method on adaptation without using any weak supervision information.
We analyzed the performance differences between training and testing using different categories of prompts, as shown in Table 8. Experiments show that our scheme still performs well under cross-prompt conditions.
In addition, we also analyzed the experimental results of optimizing different modules, including decoders, LayerNorm and different finetune schemes and their combinations. The experiments proved the performance of the finetune encoder. The LoRA scheme works best.
Although basic vision models can perform well on segmentation tasks, It will still suffer from poor performance in downstream tasks. We study the generalization ability of the Segment-Anything model in multiple downstream image segmentation tasks and propose a self-training method based on anchor regularization and low-rank fine-tuning. This method does not require access to the source data set, has low memory cost, is naturally compatible with weak supervision, and can significantly improve the adaptive effect. After extensive experimental verification, the results show that our proposed domain adaptation method can significantly improve the generalization ability of SAM under various distribution shifts. The above is the detailed content of CVPR 2024 | Segmentation of all models has poor generalization ability of SAM? Domain adaptation strategy solved. For more information, please follow other related articles on the PHP Chinese website!