Home > Article > Technology peripherals > CVPR 2024 | With the help of neural structured light, Zhejiang University realizes real-time acquisition and reconstruction of dynamic three-dimensional phenomena
The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com.
Efficient and high-quality reconstruction of dynamic three-dimensional physical phenomena such as smoke is an important issue in related scientific research. It has broad application prospects in aerodynamic design verification, meteorological three-dimensional observation and other fields. . By collectively reconstructing three-dimensional density sequences that change over time, scientists can better understand and verify various complex physical phenomena in the real world.
Figure 1 shows the importance of observing dynamic three-dimensional physical phenomena for scientific research. The picture shows the world's largest wind tunnel NFAC conducting aerodynamic experiments on commercial truck entities.
However, it is quite difficult to quickly acquire and reconstruct dynamic three-dimensional density fields with high quality in the real world. First, three-dimensional information is difficult to measure directly through common two-dimensional image sensors (such as cameras). In addition, high-speed changing dynamic phenomena place high demands on physical acquisition capabilities: a complete sampling of a single three-dimensional density field needs to be intercepted in a very short time, otherwise the three-dimensional density field itself will change. The fundamental challenge here is how to resolve the information gap between the measurement sample itself and the dynamic three-dimensional density field reconstruction results.
Current mainstream research work uses prior knowledge to make up for the lack of information in measurement samples. The calculation cost is high, and the reconstruction quality is poor when the prior conditions are not met. Different from the mainstream research ideas, the research team of the National Key Laboratory of Computer-Aided Design and Graphics Systems of Zhejiang University believes that the key to solving the problem lies in increasing the information content of the unit measurement sample.
The research team not only uses AI to optimize the reconstruction algorithm, but also uses AI to help design physical collection methods to achieve fully automatic software and hardware joint optimization driven by the same goal, essentially improving the amount of information about the target object in the unit measurement sample. . By simulating physical optical phenomena in the real world, artificial intelligence can decide how to project structured light, how to collect corresponding images, and how to reconstruct a dynamic three-dimensional density field from the sample book. In the end, the research team only used a lightweight hardware prototype containing a single projector and a small number of cameras (1 or 3) to reduce the number of structured light patterns to model a single three-dimensional density field (spatial resolution 128x128x128) to 6, achieving Efficient acquisition set of 40 three-dimensional density fields per second.
The team innovatively proposed a lightweight one-dimensional decoder in the reconstruction algorithm, using local input light as part of the decoder input, and shared decoder parameters under different materials captured by different cameras, significantly improving Reduce network complexity and increase calculation speed. In order to fuse the decoding results of different cameras, a 3D U-Net fusion network with a simple structure is designed. The final reconstruction of a single three-dimensional density field only takes 9.2 milliseconds. Compared with SOTA research work, the reconstruction speed is increased by 2-3 orders of magnitude, achieving real-time high-quality reconstruction of the three-dimensional density field. The related research paper "Real-time Acquisition and Reconstruction of Dynamic Volumes with Neural Structured Illumination" has been accepted by CVPR 2024, the top international academic conference in the field of computer vision.
Paper link: https://svbrdf.github.io/publications/realtimedynamic/realtimedynamic.pdf
Research homepage: https://svbrdf.github.io/publications/realtimedynamic/project.html
Related work can be divided into the following two categories according to whether the lighting is controlled during the collection process.
The first type of work based on non-controllable lighting does not require a special light source and does not control the lighting during the collection process, so the requirements for collection conditions are looser [2,3]. Since a single-view camera captures a two-dimensional projection of a three-dimensional structure, it is difficult to distinguish different three-dimensional structures with high quality. In this regard, one idea is to increase the number of collected viewing angle samples, such as using dense camera arrays or light field cameras, which will lead to high hardware costs. Another idea is still to sparsely sample the perspective domain and fill the information gap through various types of prior information, such as heuristic priors, physical rules or prior knowledge learned from existing data. Once the a priori conditions are not met in practice, the quality of the reconstruction results of this type of method will deteriorate. Furthermore, its computational overhead is too expensive to support real-time reconstruction.
The second type of work uses controllable lighting to actively control lighting conditions during the collection process [4,5]. Such work encodes lighting to more actively probe the physical world and also relies less on priors, resulting in higher reconstruction quality. Depending on whether a single lamp or multiple lamps are used simultaneously, related work can be further classified into scanning methods and illumination multiplexing methods. For dynamic physical objects, the former must achieve high scanning speeds by using expensive hardware, or sacrifice the integrity of the results to reduce the acquisition burden. The latter significantly improves collection efficiency by programming multiple light sources simultaneously. However, for high-quality fast real-time density fields, the sampling efficiency of existing methods is still insufficient [5].
The work of the Zhejiang University team falls into the second category. Different from most existing work, this research work uses artificial intelligence to jointly optimize physical acquisition (i.e., neural structured light) and computational reconstruction, thereby achieving efficient and high-quality dynamic three-dimensional density field modeling.
Hardware prototype
The research team built a single commercial projector (BenQ X3000: resolution 1920×1080, speed 240fps) and three industrial cameras (Basler acA1440- 220umQGR: a simple hardware prototype composed of 1440×1080 resolution and 240fps speed (as shown in Figure 3). Six pre-trained structured light patterns are cyclically projected through the projector, and the three cameras shoot simultaneously, and dynamic three-dimensional density field reconstruction is performed based on the images collected by the cameras. The angles of the four devices relative to the collection object are the optimal arrangements selected after simulations from different simulation experiments.
Figure 3: Acquisition hardware prototype. (a) Real shot of the hardware prototype, with three white tags on the stage used to synchronize the camera and projector. (b) Schematic diagram of the geometric relationship between camera, projector and subject (top view).
Software processing
The R&D team designs a deep neural network composed of encoders, decoders and aggregation modules. The weights in its encoder directly correspond to the structured light intensity distribution during acquisition. The decoder takes a sample measured on a single pixel as input, predicts a one-dimensional density distribution and interpolates it into a three-dimensional density field. The aggregation module combines the multiple three-dimensional density fields predicted by the decoder corresponding to each camera into the final result. By using trainable structured light and a lightweight one-dimensional decoder, this study can more easily learn the essential relationship between structured light patterns, two-dimensional photos and three-dimensional density fields, making it less likely to overfit to the training data. middle. Figure 4 below shows the overall pipeline, and Figure 5 shows the relevant network structure.
Figure 4: Global acquisition and reconstruction pipeline (a), and from structured light pattern to one-dimensional local incident light (b) and from predicted The resampling process of the one-dimensional density distribution back to the three-dimensional density field (c). The study starts with a simulated/real three-dimensional density field, onto which pre-optimized structured light patterns (i.e. weights in the encoder) are first projected. For each valid pixel in each camera view, all its measurements and the resampled local incident light are fed to the decoder to predict the one-dimensional density distribution on the corresponding camera ray. All density distributions from one camera are then collected and resampled into a single three-dimensional density field. In the multi-camera case, this study fuses the predicted density fields of each camera to obtain the final result.
# Figure 5: Architecture of the 3 main components of the network: encoder, decoder and aggregation module.
Result display
Figure 6 shows the partial reconstruction results of four different dynamic scenes using this method. To generate dynamic water mist, the researchers added dry ice to bottles containing liquid water to create water mist, and controlled the flow through valves and used rubber tubes to guide it further to the collection device.# Figure 6: Reconstruction results of different dynamic scenes. Each row is the visualization result of a selected part of the reconstructed frame in a certain water mist sequence. The number of water mist sources in the scene from top to bottom is: 1, 1, 3 and 2 respectively. As shown in the orange mark on the upper left, A, B, and C respectively correspond to the images collected by the three input cameras, and D is a real-shot reference image similar to the reconstruction result rendering perspective. The timestamp is displayed in the lower left corner. For detailed dynamic reconstruction results, please see the paper video.
In order to verify the correctness and quality of this research, the research team compared this method with related SOTA methods on real static objects (as shown in Figure 7). Figure 7 also compares the reconstruction quality under different camera numbers. All reconstruction results are plotted under the same new unacquired perspective and quantitatively evaluated by three evaluation metrics. As can be seen from Figure 7, thanks to the optimization of acquisition efficiency, the reconstruction quality of this method is better than the SOTA method.
Figure 7: Comparison of different techniques on real static objects. From left to right are the optical layer cutting method [4], this method (three cameras), this method (double camera), this method (single camera), using hand-designed structured light under a single camera [5], SOTA's PINF Visualization of reconstruction results of [3] and GlobalTrans [2] methods. Taking the optical slice results as a benchmark, and for all other results, their quantitative errors are listed in the lower right corner of the corresponding images, evaluated with three metrics SSIM/PSNR/RMSE (×0.01). All reconstructed density fields are rendered using non-input views, #v represents the number of views acquired and #p represents the number of structured light patterns used.
The research team also quantitatively compared the reconstruction quality of different methods on dynamic simulation data. Figure 8 shows the reconstruction quality comparison of simulated smoke sequences. For detailed frame-by-frame reconstruction results, please see the paper video.
Figure 8: Comparison of different methods on simulated smoke sequences. From left to right are the real values, reconstruction results of this method, PINF [3] and GlobalTrans [2]. The rendering results of the input view and new view are shown in the first and second rows respectively. The quantitative error SSIM/PSNR/RMSE (×0.01) is shown in the lower right corner of the corresponding image. For the average error of the entire reconstructed sequence, please refer to the supplementary material of the paper. In addition, please see the paper video for the dynamic reconstruction results of the entire sequence.
Future Outlook
The research team plans to apply this method on more advanced acquisition equipment (such as light field projectors [6]) Dynamic acquisition and reconstruction. The team also hopes to further reduce the number of structured light patterns and cameras required for collection by collecting richer optical information (such as polarization state). In addition, combining this method with neural expressions (such as NeRF) is also one of the future development directions that the team is interested in. Finally, allowing AI to more actively participate in the design of physical acquisition and computational reconstruction, and not be limited to post-processing software, may provide new ideas for further improving physical perception capabilities, and ultimately achieve efficient and high-quality modeling of different complex physical phenomena. .
Reference:
[1]. Inside the World's Largest Wind Tunnel https://youtu.be /ubyxYHFv2qw?si=KK994cXtARP3Atwn
[2]. Erik Franz, Barbara Solenthaler, and Nils Thuerey. Global transport for fluid reconstruction with learned selfsupervision. In CVPR, pages 1632–1642, 2021.
[3]. Mengyu Chu, Lingjie Liu, Quan Zheng, Erik Franz, HansPeter Seidel, Christian Theobalt, and Rhaleb Zayer . Physics informed neural fields for smoke reconstruction with sparse data. ACM Transactions on Graphics, 41 (4):1–14, 2022.
[4]. Tim Hawkins, Per Einarsson, and Paul Debevec. Acquisition of time-varying participating media. ACM Transactions on Graphics, 24 (3):812–815, 2005.
[5]. Jinwei Gu, Shree K. Nayar, Eitan Grinspun, Peter N. Belhumeur, and Ravi Ramamoorthi. Compressive structured light for recovering inhomogeneous participating media. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35 (3):1 –1, 2013.
[6]. Xianmin Xu, Yuxin Lin, Haoyang Zhou, Chong Zeng, Yaxin Yu, Kun Zhou, and Hongzhi Wu. A Unified spatial-angular structured light for single-view acquisition of shape and reflectance. In CVPR, pages 206–215, 2023.
The above is the detailed content of CVPR 2024 | With the help of neural structured light, Zhejiang University realizes real-time acquisition and reconstruction of dynamic three-dimensional phenomena. For more information, please follow other related articles on the PHP Chinese website!