Home  >  Article  >  Technology peripherals  >  The first target detection diffusion model, better than Faster R-CNN and DETR, detects directly from random frames

The first target detection diffusion model, better than Faster R-CNN and DETR, detects directly from random frames

WBOY
WBOYforward
2023-04-13 08:34:07876browse

Diffusion Model (Diffusion Model), as a new SOTA in deep generation models, has surpassed the original SOTA in image generation tasks: such as GAN, and has excellent performance in many application fields, such as computer vision, NLP, molecular graph modeling, time series modeling, etc.

Recently, Luo Ping's team from the University of Hong Kong and researchers from Tencent AI Lab jointly proposed a new framework DiffusionDet, which applies the diffusion model to target detection. As far as we know, there is no research that can successfully apply the diffusion model to target detection. It can be said that this is the first work to use the diffusion model for target detection.

What is the performance of DiffusionDet? Evaluated on the MS-COCO data set, using ResNet-50 as the backbone, under a single sampling step, DiffusionDet achieves 45.5 AP, significantly better than Faster R-CNN (40.2 AP), DETR (42.0 AP), and comparable to Sparse R-CNN (45.0 AP) is equivalent. By increasing the number of sampling steps, the DiffusionDet performance is further improved to 46.2 AP. In addition, DiffusionDet also performed well on the LVIS dataset, achieving 42.1 AP using swing-base as the backbone.

The first target detection diffusion model, better than Faster R-CNN and DETR, detects directly from random frames

  • Paper address: https://arxiv.org/pdf/2211.09788.pdf
  • Project address https://github.com/ShoufaChen/DiffusionDet

This study found that in traditional target detection There is a drawback in that they rely on a fixed set of learnable queries. Then researchers wondered: Is there a simple way to do object detection that doesn't even require learnable queries?

In order to answer this question, this article proposes DiffusionDet, a framework that can detect targets directly from a set of random boxes. It formulates target detection as a process from the noise box to the target box. noise diffusion process. This noise-to-box approach does not require heuristic target priors nor learnable queries, which further simplifies target candidates and advances detection pipelines.

As shown in Figure 1 below, this study believes that the noise-to-box paradigm is similar to the noise-to-image process in the denoising diffusion model, which is a type of likelihood-based process. The model uses the learned denoising model to gradually remove the noise in the image to generate the image.

The first target detection diffusion model, better than Faster R-CNN and DETR, detects directly from random frames

DiffusionDet solves the target detection task through the diffusion model, that is, the detection is regarded as the position (center coordinates) and size (width and height) of the bounding box in the image Spatial generation tasks. In the training phase, Gaussian noise controlled by the variance table (schedule) is added to the ground truth box to obtain the noise box. These noisy boxes are then used to crop regions of interest (RoI) from the output feature maps of backbone encoders (such as ResNet, Swin Transformer). Finally, these RoI features are sent to the detection decoder, which is trained to predict the ground truth box without noise. In the inference phase, DiffusionDet generates bounding boxes by inverting the learned diffusion process, which adjusts the noise prior distribution to the learned distribution on the bounding box.

Method Overview

Because the diffusion model iteratively generates data samples, the model f_θ needs to be run multiple times during the inference phase. However, applying f_θ directly on the original image at each iteration step is computationally difficult. Therefore, the researchers proposed to divide the entire model into two parts, namely the image encoder and the detection decoder. The former is run only once to extract the depth feature representation from the original input image Progressively refine box predictions in z_t.

The image encoder takes a raw image as input and extracts its high-level features for the detection decoder. Researchers use convolutional neural networks such as ResNet and Transformer-based models such as Swin to implement DiffusionDet. Meanwhile, feature pyramid networks are used to generate multi-scale feature maps for ResNet and Swin backbone networks.

The detection decoder borrows from Sparse R-CNN, takes a set of proposal boxes as input, crops RoI features from the feature map generated by the image encoder, and sends them to the detection head to obtain box regression and classification result. Furthermore, the detection decoder consists of 6 cascaded stages.

Training

In the training process, the researcher first constructed the diffusion from the ground truth box to the noise box process, and then train the model to reverse this process. Algorithm 1 below provides the pseudocode of the DiffusionDet training process.

The first target detection diffusion model, better than Faster R-CNN and DETR, detects directly from random frames

Truth box filling. For modern object detection benchmarks, the number of instances of interest often varies from image to image. Therefore, we first fill some additional boxes to the original ground truth boxes so that all boxes are summed up to a fixed number N_train. They explored several filling strategies, such as repeating existing ground-truth boxes, concatenating random boxes, or image-sized boxes.

Frame is damaged. The researcher adds Gaussian noise to the filled ground truth box. The noise scale is controlled by α_t in the following formula (1), which adopts monotonically decreasing cosine scheduling at different time steps t.

The first target detection diffusion model, better than Faster R-CNN and DETR, detects directly from random frames

Training loss. The detection decoder takes N_train corrupted boxes as input and predicts N_train predictions of class classification and box coordinates. Also apply set prediction loss on the N_train prediction set.

Inference

The inference process of DiffusionDet is a denoising sampling process from noise to target frame. Starting from a box sampled from a Gaussian distribution, the model gradually refines its predictions as shown in Algorithm 2 below.

The first target detection diffusion model, better than Faster R-CNN and DETR, detects directly from random frames

Sampling steps. At each sampling step, random boxes or estimated boxes from the previous sampling step are sent to the detection decoder to predict class classification and box coordinates. After obtaining the box of the current step, DDIM is employed to estimate the box of the next step.

Box updates. To make inference better consistent with training, we propose a box updating strategy to recover unexpected boxes by replacing them with random boxes. Specifically, they first filter out unexpected boxes with scores below a certain threshold, and then concatenate the remaining boxes with new random boxes sampled from a Gaussian distribution.

Once-for-all. Thanks to the randomized box design, researchers can evaluate DiffusionDet using any number of random boxes and sampling steps. For comparison, previous methods rely on the same number of processing boxes during training and evaluation, and the detection decoder is used only once in the forward pass.

Experimental results

In the experimental part, the researcher first demonstrated the Once-for-all property of DiffusionDet, and then compared DiffusionDet with previous data in MS-COCO and LVIS. A collection of mature detectors for comparison.

The main feature of DiffusionDet is to train all inference instances once. Once the model is trained, it can be used to change the number of boxes and sample steps in inference, as shown in Figure 4 below. DiffusionDet can achieve higher accuracy by using more boxes or/and more refinement steps, but at the cost of higher latency. Therefore, we deployed a single DiffusionDet to multiple scenarios and achieved the desired speed-accuracy trade-off without retraining the network.

The first target detection diffusion model, better than Faster R-CNN and DETR, detects directly from random frames

The researchers compared DiffusionDet with previous detectors on the MS-COCO and LVIS data sets, as shown in Table 1 below. They first compared the object detection performance of DiffusionDet with previous detectors on MS-COCO. The results show that DiffusionDet without the refinement step achieves 45.5 AP using the ResNet-50 backbone network, surpassing previous mature methods such as Faster R-CNN, RetinaNet, DETR and Sparse R-CNN by a large margin. And DiffusionDet shows stable improvement when the size of the backbone network is enlarged.

The first target detection diffusion model, better than Faster R-CNN and DETR, detects directly from random frames

Table 2 below shows the results on the more challenging LVIS data set. It can be seen that DiffusionDet uses more details. ization step can achieve significant gains.

The first target detection diffusion model, better than Faster R-CNN and DETR, detects directly from random frames

For more experimental details, please refer to the original paper.

The above is the detailed content of The first target detection diffusion model, better than Faster R-CNN and DETR, detects directly from random frames. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete