


Method for indoor frame estimation using panoramic visual self-attention model
1. Research background
This method mainly focuses on the task of indoor estimation layout estimation, and the task inputs 2D images. , output a three-dimensional model of the scene described by the picture. Considering the complexity of directly outputting a 3D model, this task is generally broken down into outputting the information of three lines: wall lines, ceiling lines, and ground lines in the 2D image, and then reconstructing the 3D model of the room through post-processing operations based on the line information. . The three-dimensional model can be further used in specific application scenarios such as indoor scene reproduction and VR house viewing in the later stage. Different from the depth estimation method, this method restores the spatial geometric structure based on the estimation of indoor wall lines. The advantage is that it can make the geometric structure of the wall flatter; the disadvantage is that it cannot restore the geometric information of detailed items such as sofas and chairs in indoor scenes.
Depending on the input image, it can be divided into perspective-based and panorama-based methods. Compared with perspective views, panoramas have a larger viewing angle and richer image information. With the popularization of panoramic acquisition equipment, panoramic data is becoming more and more abundant, so there are currently many algorithms for indoor frame estimation based on panoramic images that have been widely studied
Relevant algorithms mainly include LayoutNet, HorizonNet, HohoNet and Led2-Net, etc. Most of these methods are based on convolutional neural networks. The wall line prediction effect is poor in locations with complex structures, such as noise interference, self-occlusion, etc. Prediction results such as discontinuous wall lines and incorrect wall line positions. In the wall line position estimation task, only focusing on local feature information will lead to this type of error. It is necessary to use the global information in the panorama to consider the position distribution of the entire wall line for estimation. The CNN method performs better in the task of extracting local features, and the Transformer method is better at capturing global information. Therefore, the Transformer method can be applied to indoor frame estimation tasks to improve task performance.
#Due to the dependence of training data, the effect of estimating the panoramic indoor frame by applying the Transformer based on perspective pre-training alone is not ideal. The PanoViT model maps the panorama to the feature space in advance, uses the Transformer to learn the global information of the panorama in the feature space, and also considers the apparent structure information of the panorama to complete the indoor frame estimation task.
2. Method introduction and result display
1. PanoViT
Network The structural framework contains 4 modules, namely Backbone, vision transformer decoder, frame prediction module, and boundary enhancement module. The Backbone module maps the panorama to the feature space, the vison transformer encoder learns global correlations in the feature space, and the frame prediction module converts the features into wall line, ceiling line, and ground line information. Post-processing can further obtain the three-dimensional model of the room and its boundaries. The enhancement module highlights the role of boundary information in panoramic images for indoor frame estimation.
① Backbone module
Since the direct use of transformer to extract panoramic features is not effective, it has been proven that the CNN-based method is effective Effectiveness, i.e. CNN features can be used to predict house frames. Therefore, we use the backbone of CNN to extract feature maps of different scales of the panorama, and learn the global information of the panoramic image in the feature maps. Experimental results show that using transformer in feature space is significantly better than applying it directly on the panorama
② Vision transformer encoder module
The main architecture of Transformer can be mainly divided into three modules, including patch sampling, patch embedding and transformer’s multi-head attention. The input considers both the panoramic image feature map and the original image and uses different patch sampling methods for different inputs. The original image uses the uniform sampling method, and the feature map uses the horizontal sampling method. The conclusion from HorizonNet believes that horizontal features are of higher importance in the wall line estimation task. Referring to this conclusion, the feature map features are compressed in the vertical direction during the embedding process. The Recurrent PE method is used to combine features of different scales and learn in the transformer model of multi-head attention to obtain a feature vector with the same length as the horizontal direction of the original image. The corresponding wall line distribution can be obtained through different decoder heads.
Random cyclic position encoding (Recurrent Position Embedding) takes into account that the displacement of the panorama along the horizontal direction does not change the characteristics of the visual information of the image, so each training The initial position is randomly selected along the horizontal axis, so that the training process pays more attention to the relative position between different patches rather than the absolute position.
③ Geometric information of panorama
Full utilization of geometric information in panorama can contribute to indoor frame estimation task performance improvement. The boundary enhancement module in the PanoViT model emphasizes how to use the boundary information in the panorama, and 3D Loss helps reduce the impact of panorama distortion.
The boundary enhancement module takes into account the linear characteristics of the wall lines in the wall line detection task. The line information in the image is of prominent importance, so it is necessary to highlight the boundary information so that the network can understand the distribution of the center lines of the image. . Use the boundary enhancement method in the frequency domain to highlight the panorama boundary information, obtain the frequency domain representation of the image based on fast Fourier transform, use a mask to sample in the frequency domain space, and transform back to the image with highlighted boundary information based on the inverse Fourier transform . The core of the module lies in the mask design. Considering that the boundary corresponds to high-frequency information, the mask first selects a high-pass filter; and samples different frequency domain directions according to the different directions of different lines. This method is simpler to implement and more efficient than the traditional LSD method.
Previous work calculated the pixel distance on the panorama as an estimation error. Due to the distortion of the panorama, the pixel distance on the picture is not proportional to the real distance in the 3D world. PanoViT uses a 3D loss function to calculate the estimation error directly in 3D space.
2. Model results
Use Martroport3D and PanoContext public data sets to conduct experiments, using 2DIoU and 3DIoU as evaluation indicators, and Compare with SOTA method. The results show that PanoViT's model evaluation indicators on the two data sets have basically reached the optimal level, and are only slightly inferior to LED2 on specific indicators. By comparing the model visualization results with Hohonet, it can be found that PanoViT can accurately identify the direction of wall lines in complex scenes. By comparing the Recurrent PE, boundary enhancement and 3D Loss modules in ablation experiments, the effectiveness of these modules can be verified
3. How to use in ModelScope
- Open the modelscope official website: https://modelscope.cn/home.
- Search for "Panorama Indoor Frame Estimation".
- Click Quick Use-Online Environment Use-Quick Experience to open the notebook.
- Enter the homepage sample code, upload a 1024*512 panoramic image, modify the image loading path, and run to output the wall line prediction results.
The above is the detailed content of Method for indoor frame estimation using panoramic visual self-attention model. For more information, please follow other related articles on the PHP Chinese website!

The unchecked internal deployment of advanced AI systems poses significant risks, according to a new report from Apollo Research. This lack of oversight, prevalent among major AI firms, allows for potential catastrophic outcomes, ranging from uncont

Traditional lie detectors are outdated. Relying on the pointer connected by the wristband, a lie detector that prints out the subject's vital signs and physical reactions is not accurate in identifying lies. This is why lie detection results are not usually adopted by the court, although it has led to many innocent people being jailed. In contrast, artificial intelligence is a powerful data engine, and its working principle is to observe all aspects. This means that scientists can apply artificial intelligence to applications seeking truth through a variety of ways. One approach is to analyze the vital sign responses of the person being interrogated like a lie detector, but with a more detailed and precise comparative analysis. Another approach is to use linguistic markup to analyze what people actually say and use logic and reasoning. As the saying goes, one lie breeds another lie, and eventually

The aerospace industry, a pioneer of innovation, is leveraging AI to tackle its most intricate challenges. Modern aviation's increasing complexity necessitates AI's automation and real-time intelligence capabilities for enhanced safety, reduced oper

The rapid development of robotics has brought us a fascinating case study. The N2 robot from Noetix weighs over 40 pounds and is 3 feet tall and is said to be able to backflip. Unitree's G1 robot weighs about twice the size of the N2 and is about 4 feet tall. There are also many smaller humanoid robots participating in the competition, and there is even a robot that is driven forward by a fan. Data interpretation The half marathon attracted more than 12,000 spectators, but only 21 humanoid robots participated. Although the government pointed out that the participating robots conducted "intensive training" before the competition, not all robots completed the entire competition. Champion - Tiangong Ult developed by Beijing Humanoid Robot Innovation Center

Artificial intelligence, in its current form, isn't truly intelligent; it's adept at mimicking and refining existing data. We're not creating artificial intelligence, but rather artificial inference—machines that process information, while humans su

A report found that an updated interface was hidden in the code for Google Photos Android version 7.26, and each time you view a photo, a row of newly detected face thumbnails are displayed at the bottom of the screen. The new facial thumbnails are missing name tags, so I suspect you need to click on them individually to see more information about each detected person. For now, this feature provides no information other than those people that Google Photos has found in your images. This feature is not available yet, so we don't know how Google will use it accurately. Google can use thumbnails to speed up finding more photos of selected people, or may be used for other purposes, such as selecting the individual to edit. Let's wait and see. As for now

Reinforcement finetuning has shaken up AI development by teaching models to adjust based on human feedback. It blends supervised learning foundations with reward-based updates to make them safer, more accurate, and genuinely help

Scientists have extensively studied human and simpler neural networks (like those in C. elegans) to understand their functionality. However, a crucial question arises: how do we adapt our own neural networks to work effectively alongside novel AI s


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

WebStorm Mac version
Useful JavaScript development tools

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

SublimeText3 Chinese version
Chinese version, very easy to use

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool
