The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com
The first author of the paper is Chen Jiahao, a second-year master's student in the School of Computer Science, Sun Yat-sen University. His research direction is neural rendering and three-dimensional reconstruction. His supervisor is Professor Li Guanbin. . The paper was his first work. The corresponding author of the paper is Professor Li Guanbin from the School of Computer Science and Human-Machine-Object Intelligent Integration Laboratory of Sun Yat-sen University, a doctoral supervisor and winner of the National Outstanding Youth Fund. The team's main research areas are visual perception, scene modeling, understanding and generation. To date, he has published more than 150 CCF Category A/CAS Area 1 papers, which have been cited by Google Scholar more than 12,000 times. He has won the Wu Wenjun Artificial Intelligence Outstanding Youth Award and other honors. Since it was proposed, Neural Radiance Fields (NeRF) have received great attention due to their excellent performance in new perspective synthesis and three-dimensional reconstruction. Although a lot of work is trying to improve the rendering quality or running speed of NeRF, a practical problem is rarely mentioned: If unexpected transient interference appears in the scene to be modeled, we How to eliminate their impact on NeRF? In this article, researchers from Sun Yat-sen University, Cardiff University, University of Pennsylvania and Simou Technology conducted in-depth research on this and proposed a novel paradigm to solve this problem. By summarizing the advantages and disadvantages of existing methods and broadening the application ideas of existing technologies, this method can not only accurately distinguish static and transient elements in various scenes and improve the rendering quality of NeRF, but has also been shortlisted for CVPR 2024 Best Paper Candidate.
- Paper link: https://arxiv.org/abs/2403.17537
- Project link: https://www.sysu-hcp.net/projects/cv/132.html
Let us understand this work together.
New perspective synthesis is an important task in computer vision and graphics. The algorithm model needs to use given multi-view images and camera poses to generate images corresponding to the target pose. . NeRF has achieved important breakthroughs on this task, but its effectiveness is related to the assumption of static scenes.
Specifically, NeRF requires that the scene to be modeled remains stationary during the shooting process, and the multi-view image content must be consistent. In reality, it is difficult for us to meet this requirement. For example, when shooting outdoors, vehicles or passers-by outside the scene may move randomly in the lens, and when shooting indoors, an object or shadow may inadvertently block the lens. We call elements that exhibit motion or inconsistency outside of this type of scene transient distractors. If we cannot eliminate them, they will introduce artifacts into NeRF's rendering results.
(The existence of transient interference (yellow box) can lead to a large number of pseudohadoscopy. The current methods to solve the problem of transient interferers can be roughly divided into two types.
The first method uses existing segmentation models such as semantic segmentation to explicitly obtain masks related to distractors, and then masks the corresponding pixels when training NeRF. Although such methods can produce accurate segmentation results, they are not universal. This is because we need to know the prior knowledge related to the distractors (such as object category, initial mask, etc.) in advance, and the model can identify these distractors. Different from the first method,
the second method uses a heuristic algorithm to implicitly handle transient distractors when training NeRF and does not require prior knowledge. Although such methods are more general, they cannot accurately separate transient distractors and static scene elements due to design complexity and high degree of ill-posedness. For example, since the color texture corresponding to a transient pixel is inconsistent at different viewing angles, the color residual between the predicted value and the true value of this pixel is often larger than the residual of a static pixel when training NeRF. However, high-frequency static details in the scene will also have excessive residuals due to difficulty in fitting. Therefore, some methods that remove transient interference by setting residual thresholds can easily lose high-frequency static details.
Comparison between existing methods and the heuristic guided segmentation (HuGS) proposed in this paper. When a static scene is disturbed by transient distractors, (a) segmentation-based methods rely on prior knowledge and will suffer from related artifacts due to the inability to identify unexpected transient objects (such as pizza); (b) heuristic-based methods The method is more general but not accurate enough (e.g. high-frequency static tablecloth texture is lost); (c) HuGS combines their advantages and is able to accurately separate transient distractors and static scene elements, thereby significantly improving the results of NeRF.
The method based on the segmentation model is accurate but not universal, and the method based on the heuristic algorithm is universal but inaccurate. So, can they be combined to make up for each other's strengths and make up for it? Is it both accurate and universal?
Therefore, the author of the paper proposed
a novel paradigm called Heuristics-Guided Segmentation (HuGS), motivated by "horses for courses". By cleverly combining hand-designed heuristics and cue-driven segmentation models, HuGS can accurately differentiate between transient distractors and static elements in a scene without additional prior knowledge.
Specifically, HuGS first uses a heuristic algorithm to roughly distinguish static transient elements in multi-view images and outputs rough cues, and then uses the rough cues to guide the segmentation model to generate more accurate segmentation masks. When training NeRF, these masks will be used to shield transient pixels and eliminate the impact of transient distractors on NeRF. HuGS design ideas.
In terms of specific implementation, the author of the paper
chose Segment Anything Model (SAM) as the segmentation model of HuGS. SAM is currently the most advanced prompt-driven segmentation model, which can accept different types of prompt inputs such as points, boxes, and masks and output corresponding instance segmentation masks. As for the heuristic algorithm, the author proposed
a combined heuristic after in-depth analysis: the heuristic based on Structure-from-Motion (SfM) is used to capture the high-frequency static details of the scene, while the heuristic based on A color residual heuristic is used to capture low-frequency static details. The rough static masks output by the two heuristics are different from each other, and their union is used to guide SAM to a more accurate static mask. By seamlessly combining these two heuristics, HuGS can robustly identify various types of static elements when faced with varying texture details.
HuGS flowchart. (a) Given an unordered multi-view image in a static scene with transient distractors, HuGS first obtains two heuristic information. (b) The SfM-based heuristic algorithm uses SfM to obtain the distinction between static feature points and transient feature points , and then uses sparse static feature points as hints to guide SAM Generate dense static masks. (c) Color residual-based heuristics rely on NeRF that is partially trained (i.e., trained with only thousands of iterations). The color residuals between its predicted and real images can be used to generate another set of static masks. (d) The combination of two different masks ultimately guides SAM to generate (e) an accurate static mask for each image. SfM-based heuristic algorithmSfM is a technology that reconstructs three-dimensional structures from two-dimensional images. After extracting the 2D features of the image, SfM performs matching and geometric verification on the features, and reconstructs a sparse 3D point cloud. SfM is often used to estimate image camera poses in NeRF, and the authors of the paper found that SfM can also be used to distinguish static and transient elements of the scene. Assuming that the number of matches for a certain two-dimensional feature point is the number of other two-dimensional feature points corresponding to the same three-dimensional point cloud point, then the number of matches for two-dimensional feature points from the static area is greater than the number of match points from the transient area. Based on this finding, we can set a threshold on the number of matches to filter out static feature points, and then use SAM to convert the static feature points into static masks. In order to verify the correctness of this finding, the authors of the paper conducted statistics on the Kubric data set. As shown in the figure below, there are significant differences in the number of feature point matches in different image areas. Another visualization shows that reasonable threshold settings can remove transient feature points while retaining static feature points.
The left picture is a histogram of the number of matching numbers of feature points from different image areas. The matching number of static area feature points is evenly distributed in the [0,200] interval, while the transient area feature points The number of matches approaches 0 and is concentrated in the [0,10] interval. The picture on the right is a curve chart of the residual feature point density in different image areas after filtering as the threshold changes. The residual feature point density of the entire image and the static area decreases linearly as the threshold increases, while the residual feature point density of the transient area decreases linearly. Decreases exponentially and becomes almost 0 after a threshold greater than 0.2. Visualized distribution of remaining feature points of two images from different perspectives as the threshold increases. The remaining feature points located in the transient region are gradually removed, while most of the feature points in the static region are still retained. Color Residual Based Heuristic While the SfM based heuristic performs well in most scenes, it cannot capture static smooth textures well, this is because Smooth textures lack significant features and are difficult to be recognized by SfM's feature extraction algorithm. In order to be able to identify low-frequency textures, the author of the paper introduced a heuristic algorithm based on color residuals: first partially train NeRF on the original multi-view images (that is, only iterate thousands of times), obtain an underfitting model, and then Get the color residual between the rendered image and the target image. As mentioned in the background introduction, the color residuals of low-frequency static texture areas are smaller than the residuals of other types of areas, so a threshold can be set on the color residuals to obtain a rough mask related to low-frequency static textures. The mask obtained by color residual can be complemented by the mask obtained by SfM to form a complete result.
A combination of two heuristic algorithms, where (a) is the input target image, and (d) is the NeRF rendering result of only five thousand iterations. The static mask (b) resulting from the SfM-based heuristic captures high-frequency static details (such as box texture) while missing static smooth parts (such as the white chair back). The static mask (e) derived from the color residual-based heuristic and its segmentation mask (f) derived from guided SAM alone achieve opposite effects. Their union (c) distinguishes transient distractors (i.e. pink balloons) while covering all static elements.
Here are shown the visual segmentation process of HuGS in two different real scenes, and the baseline model Mip-NeRF 360 when applying static mask Comparison of rendering results before and after film. With the help of combined heuristics and SAM, HuGS can generate accurate static masks, while Mip-NeRF 360 eliminates a large number of artifacts after applying static masks, and the rendering quality of RGB and depth maps is significantly improved.
Qualitative/quantitative rendering result comparisonHere are shown the experimental results of the paper method on three data sets and two baseline models, as well as the comparison with existing methods. Existing methods either fail to eliminate artifacts caused by transient distractors or erase too much static texture detail. In contrast, our method can better preserve static details while effectively eliminating artifacts.
Comparison of qualitative/quantitative segmentation resultsThe author of the paper also compared it with existing segmentation algorithms on the Kubric dataset. Experimental results show that even if prior knowledge is provided, existing segmentation models such as semantic segmentation and video segmentation still perform poorly because none of the existing segmentation models are designed for this task. Although existing heuristic-based methods can roughly locate the location of transient interferers, they cannot obtain more precise segmentation results. In contrast, HuGS accurately separates transient distractors and static scene elements without additional prior knowledge by combining heuristic algorithms and segmentation models.
Ablation experiment resultsThe author of the paper also verified the impact of each component on HuGS by removing different components. The results show that the model (b) lacking the SfM-based heuristic does not reconstruct the low-frequency static texture in the blue box well, while the models (c) and (d) lacking the color residual-based heuristic lose the yellow color High frequency static details in the box. In comparison, the full method (f) gives the best numerical metrics and visualization results.
The paper proposes a novel heuristic guided segmentation paradigm, which effectively solves the common transient interference problem in NeRF real-life training. By strategically combining the complementary strengths of hand-designed heuristics and state-of-the-art segmentation models, the method achieves highly accurate segmentation of transient distractors in diverse scenes without any prior knowledge. Through carefully designed heuristics, our method is able to robustly capture high- and low-frequency static scene elements. A large number of experiments have proved the advancement of this method. The above is the detailed content of CVPR best paper candidate | New breakthrough in NeRF, using heuristic-guided segmentation to remove transient interference without additional prior knowledge. For more information, please follow other related articles on the PHP Chinese website!