1. Introduction
NeuralRecon 3D reconstruction is an advanced technology that harnesses deep learning for real-time 3D modeling of a scene with the aid of either a single-camera or a multi-camera system.This Deep network use an autometic feature extraction method. This method extracts the features from the image in deep network then put these image in 3D space automatically to construct a detailed model. NeuralRecon is able to reconstruct 3D scene in near or real-time mode with the help of optimized computThis facilitates rapid response and can be used in prompt actions.While naturalistic NeuralRecon is the most promising for indoor scene generation, it could be scalable to various naturalistic scenes and applications.Adaptive deep learning model makes NLP models competent in different data distribution by virtue of the learning power of its deep learning techniques and adaptiveness.
2. Background
The rising importance of deep learning and other emerging technologies, such as computer vision, bring new opportunities for 3D reconstruction research.When used in a monocular video stream as input, the neural network-style. 3D reconstruction technique, a.k.a. NeuralRecon, can achieve real-time and high-quality 3D scenery.These days, NeuralRecon-based 3D reconstruction technology is widely used in the applications of autopilot driving, augmented reality (AR) and virtual reality (VR), and robotic navigation because it is real-time and high efficiency.In the field of autonomous driving, the intensive and accurate 3D environment data collection is the prerequisite of path planning and obstacles avoidance; in the field of AR and VR, the focusing on high-quality 3D scene reconstruction can provide the users remarkable experience; and in the field of robot navigation, the real-time 3D reconstruction of environment digitization is the basis of robots autonomous movement.
3. Aim
The target of this research is to implement a tool for 3D scene real-time reconstruction efficiently using deep convolutional neural networks which will be trained for feature extraction directly from the image and then fused with GRU (Gated Recurrent Units) networks to achieve the proper fusion of the features identified..
4. Gantt Chart
5. Literature review
Much strides were made in 3D real-time reconstruction over the recent years, with different techniques designed to cater to different areas of the reconstructing process.Additionally, a popular solution is NeuralRecon, a framework for a live, consistent 3D reconstruction of monocular video streams, which has been introduced into this field by Lior and colleagues (Lior et al.,2021).The methodology at play here is the use of neural networks that gives accurate and real-time 3D scene reconstruction. Furthermore, the approach is just as compliant and accurate.
Furthermore, previous work showed the use of database-assisted object retrieval for robust 3D reconstruction, whereby objects were retrieved from a 3D shape database when the charts were scanned (Li et al. 2015).The aim of the method is to overcome the limit of physical absenteeism, and also enhance the physical recovery process by providing object information and thus.
Furthermore, studies have suggested integration of edge computing in a photo crowdsourcing framework for real time 3D reconstruction (Yu et al., 2022).Combining edge computing and this technique enables the achievable of 3D reconstruction speed and efficiency in real-time applications.
Additionally, in the papers authors discuss applicability depth-based methods on RGB images in the context of real-time 3D reconstruction (Watson et al., 2023).Although depth based fusion stays to be the backbone, learning based methods are far more sophisticated to offer the best reconstruction types, albeit consuming higher computational resources.
A number of studies that enhance real-time 3D reconstruction via NeuralRecon are introduced.Studies say some methods have been added that require a coherently 3D reconstruction from monocular videos, better than the existing methods in terms of accuracy and speed.According to the study (Xie et al, 2022), this research builds upon a strong foundation by designing PlanarRecon, a system for discovering and reconstructing a 3D plane in real time.Like Ju et al. (2023), this research also found an improvement in the quality and completeness of the reconstruction by incorporating monocular depth prior in DG-Recon.The completion will show the possibility of dynamic reconstruction with 3D pictures with the following models.
Briefly, the application of deep neural network frameworks, the use of databases for support of the retrievals, edge computing and all kinds of possible fusions is becoming popularized in RT3D technology.At the same time, this group of three 3D reconstruction procedure together work effectively in speeding up, precise impeding, and efficiency enhancement of real-time 3D reconstruction processes.
6. Environment and tool requirements
The first thing mentioned is the hardware environment. GPU was used in the experiment for multi-process training, and the GPU we chose is RTX 4060
For computer language we chose to use Python.
For the system we used Ubuntu. Compared to Windows, Ubuntu has many advantages.
Firstly, Ubuntu is a very secure operating system. It comes with a firewall and virus protection software, and because of its Linux kernel and privilege management, it is less vulnerable to viruses and malware attacks.Windows, although it has been improving its security, has become the target of more malware and viruses because of its wide user base.
Secondly, Ubuntu collects less personal data by default, whereas Windows 10 and later versions collect more user data by default, although users can tweak settings to limit the collection of this data. Meanwhile, Ubuntu typically requires less system resources, especially when using a lightweight desktop environment. This makes running Ubuntu on older hardware much smoother than running the latest version of Windows.
Most importantly for me, Ubuntu has a large and active community that provides users with a wide range of support resources, including forums, documentation, and online tutorials.Windows, while also having a broad user base and professional support, may not be as rich in community support as Ubuntu.
In terms of software platform. and main tools we used Anaconda to build the development environment and adopted the PyTorch framework, which is a widely used open source machine learning library particularly suitable for deep learning and computer vision research . It uses torchsparse for 3D sparse convolution operations, which is a specialised tool for processing sparse data. We chose to use Pycharm for the IDE, and Cuda for accelerating deep learning training and image processing.
In addition, different datasets ScanNet and others were used for training and assessment too.Data preparation and feature selection mechanism such as image resizing and format change are done in models training process.In serving this purpose, distributed samplers and data loaders became highly critical components that quickly loaded and processed the datasets.
7. NeuralRecon Overview
NeuralRecon is an advanced algorithm that offers a real-time 3D simulation of the area using monocular video or an array of images of one object. The motivate behind this approach is mostly the interpretation of a 3D physical workspace while observing it. To this end, Deep Learning Structures, like Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), are employed in order to efficiently aggregate and weave in deep features from the visual domain and, thereby, more detailed renderings of the 3D space can be generated.
The fundamental inside of this approach is represented by the following stages below.
NeuralRecon starts by using a pre-trained deep convolutional network (e.g.,MnasMulti training) to find and extract features from the input single-view or multi-view images. Which spatial and depth information are extracted through the feature extraction to lay the foundation for the following volume fusion as well as 3D reconstruction. Firstly, feature of an image is extracted, then NeuralRecon the features of various images are fused to generate dense feature volume through the GRU based fusion module. The next stage of definition is a target of making the recognition process cope with the information in the sequence of the image, which should be continuous and accurate.
Moreover, to make the computation faster, pankers also use pare input operation in the group feature volume or calculate the spacer separately on the cells which contain the actual data. Furthermore, the proposed method allows for a localized representation and reconstruction in these areas that are the prime contributors to the process, which minimizes the time taken in the process and maximizes the quality of reconstruction.
For the last step after the feature fusion is finished, NeuralRecon uses a series of convolutional neural network layers as well for further processing the merged volume, at last, acquiring a precise 3D scene.
8. Dataset
We mainly use four datasets, they are:We mainly use four datasets, they are:
ScanNet (Dai et al., n.d.): as usual it is one of the widely used large-scale databases in the 3D reconstruction, scene understanding, etc. It includes color images, depth images and camera position information in multiple indoor scenes of a single dataset specific to this project.
7-Scenes (Shotton et al., 2013): It is used to evaluate the performance of NeuralRecon in small indoor scenes.
8-TUM RGB-D dataset: It is also used for the evaluation, it is a dataset that provides RGB-D images and their corresponding ground truth poses for different scenes and motion modes.
ICL-NUIM dataset: It is used to further test and demonstrate the effectiveness of NeuralRecon. This dataset provides a series of synthetic and real scenes for evaluating the 3D reconstruction algorithm for indoor scenes.
9. Preprocessing
We implemented the pre-processing of the data through three pieces of code.
The code in Figure mainly implements the preprocessing process for the data collected using ARKit. Its main function and specific steps are as follows:
Extract image frames from a specified video file, scale each frame. to a specified size, and save it to a specified folder.
Use the function to synchronize the internal and external parameters of the camera based on the image frames extracted in the first step, and save the synchronized result.
Use the function to parse the path according to the data source (ARKit) to get the path of the internal and external parameter files.
Load the camera internal parameter (internal parameter matrix K and image size) and adjust the internal parameter matrix according to the original size and target size of the image.
Load the camera position (external parameter), including the position and rotation matrix.
Save the processed camera internal and external parameters separately, each file corresponds to the parameters of one image frame. or view.
Select keyframes from all frames according to the thresholds of rotation angle and translation distance (min_angle and min_distance) to ensure that there is enough view change or spatial movement between neighboring keyframes.
Successive keyframes are grouped into segments, each containing a fixed number of keyframes.
For each segment, record the segment ID, the contained image ID, and the corresponding camera external and internal parameters.
All the segment information is saved into a pickle file for subsequent 3D reconstruction process.
This preprocessing process is designed to prepare the data required for the 3D reconstruction task, including image frames, camera geometry parameters, and selected keyframes, in order to improve the efficiency and accuracy of the reconstruction process. This has a huge impact with the subsequent processing of the data that follows
(Figure1 Code for processing ARKit data)
After completing the above preprocessing I defined a PyTorch dataset class ScanNetDataset(Figure) for loading the ScanNet dataset. It is used to process the data in the 3D reconstruction task, including images, depth maps, camera in- and out-parameters, etc.
The function build_list will load the metadata from the pickle file that stores the segment information, which describes the scene name, image ID, volume origin, etc. for each segment.
The function __len__ returns the total number of samples in the dataset.
The function read_cam_file will read the camera's internal and external parameters from the file. The internal reference is used to describe the camera's focal length and principal point, while the external reference describes the camera's position and orientation in the world coordinate system.
The read_img function reads an RGB image from a specified path.
The read_depth function reads the depth map from the specified path and performs the necessary preprocessing (such as unit conversion and filtering of excessively deep areas).
The function read_scene_volumes will read the TSDF volume information of the whole scene from the file, which may include multiple scales. A caching mechanism is used for efficiency.
The function __getitem__ will load the corresponding image, depth map, camera parameters and TSDF volume data according to the index idx and return these data packed into a dictionary, optionally applying preprocessing transformations to these data.
The main purpose of this class is to provide a convenient way to load and process data from the ScanNet dataset for 3D reconstruction tasks. With this class, it is easy to use the ScanNet dataset in the PyTorch framework for training, validation and testing.
(Figure2 Implement code for processing Scannet data sets)
For the final preprocessing we need to do the following steps:
1. resize the input image according to the set target size and adjust the camera inner parameter accordingly to keep the projection relation of the image unchanged (ResizeImage).
2. Convert the information such as image data, camera inner and outer parameters from NumPy array format to PyTorch's Tensor format and make necessary dimensional adjustments to adapt to the needs of PyTorch model processing (ToTensor).
3. generate a single projection matrix by combining the camera's inner and outer reference matrices. This step is a key step in the 3D reconstruction process to map 2D image features into 3D space (IntrinsicsPoseToProjection).
4. In order to improve the generalization ability and robustness of the model, a random linear transformation is applied to the world coordinate system. This data enhancement method can simulate the effect of observing the scene from different viewpoints and locations (RandomTransformSpace).
5. Since the images in the ScanNet dataset may not meet the specific aspect ratio requirements, the image size is adjusted by means of edge filling to meet the 4:3 ratio, which facilitates subsequent processing (pad_scannet).
6. Combine the above multiple preprocessing steps through the Compose class to form. a complete preprocessing pipeline. In this way, the input data can be transformed into the format required for model training or inference through a series of sequential transformations (Compose).
(Figure 3 Code to achieve final complete preprocessing)
In summary, the entire preprocessing process is carried out by taking image data, camera parameters and other auxiliary information, following specific transformations and adjustments, and ultimately converting them into a format suitable for 3D reconstruction model processing. The images, depth maps and camera parameters in the ScanNet dataset are converted into the format required for model training, while data enhancement is provided to improve the generalization ability of the model.
10. Network Design
10.1 Network Architecture Design
At the beginning of the network, the model parameter alpha is taken from the configuration file, this parameter is used to adjust the depth and width of the network and affects the number of channels in subsequent layers. The pixel mean and standard deviation of the image are also extracted from the configuration for subsequent normalisation of the image data to ensure consistency of the input data and numerical stability of the network.
Feature Extraction Network (backbone2d) We utilise the MnasMulti class, a variant of MNASNet based on MNASNet dedicated to efficient image feature extraction. It adjusts its depth and width according to the alpha parameter.
The 3D reconstruction network (neucon_net) is implemented using the NeuConNet class. This is the core 3D reconstruction module. It is responsible for converting the 2D feature maps extracted from backbone2d into voxel representations in 3D space.
The global fusion network (fuse_to_global) is implemented using the GRUFusion class. This module is used to fuse local voxel features into the global volume, which is especially important in continuous or large-scale 3D scene reconstruction.
Forward propagation logic:
Firstly, the input image data is processed through a normalisation function that ensures that the range of values of the input features remains the same as during training. The normalised images are fed into backbone2d for feature extraction and each image is processed separately to obtain a series of feature maps. These feature maps are then fed into neucon_net for finer-grained decoding to generate sparse coordinates and corresponding voxel data (TSDF values). In non-training mode, if the output contains coordinate data, these locally generated voxel data are fused into a global volume via fuse_to_global for final 3D model reconstruction.
At the end of each forward propagation, losses are computed based on the generated output and the actual labels (if any), and these losses are used for backpropagation and parameter optimisation during model training.
Speaking in conjunction with the code
First we adjusted the channel depth of the MNASNet network using the alpha value. alpha is a scaling factor used to resize the model. Then we use torchvision.models.mnasnet1_0 to get the pre-trained MNASNet model. If alpha is equal to 1.0, the pre-trained model is loaded directly; otherwise, the MNASNet constructor is used to get a resized network model.
self.conv0 to self.conv2: these layers are extracted directly from the pre-trained MNASNet.
The convolutional layer parameters are preset in the definition of the MNASNet model, and we inherit these settings by reusing layers from the pre-trained model.
self.out1, self.out2, self.out3 are custom convolutional layers which are used to map the extracted features to new output channels. The parameters of these layers are set by these two code implementations:
nn.Conv2d(depths[4], depths[4], 1, bias=False) denotes the use of a 1x1 convolution kernel with a step size of 1 and no bias term. This setting is typically used for feature fusion or channel count adjustment.
nn.Conv2d(final_chs, depths[3], 3, padding=1, bias=False) and nn.Conv2d(final_chs, depths[2], 3, padding=1, bias=False) denote the use of 3x3 convolution kernel with step size 1, padding=1 and no bias term. This setup allows spatial features to be extracted while keeping the feature map size constant.
10.2 Network Optimisation
Network optimisation begins with the configuration of the optimiser. The optimiser is responsible for adjusting the network parameters according to the calculated gradients in order to minimise the loss function. We chose the Adam optimiser, a commonly used optimisation algorithm particularly suited to deep learning tasks with large-scale data and parameters.The Adam optimiser combines the benefits of momentum and adaptive learning rate to help the model converge faster and improve stability during training.
The effectiveness of the optimiser relies heavily on the learning rate setting. The learning rate determines the magnitude of parameter updates. In the code, not only is the initial learning rate set, but the learning rate scheduler is also used to adjust the learning rate. As the training progresses, if the model's performance improvement is found to be stagnant, the parameter tuning will be refined by decreasing the learning rate, which helps the model to approximate more accurately when it is close to the optimal solution, and avoids excessive update steps that lead to the loss function bouncing.
To improve training efficiency, we support distributed training. With the torch.distributed library and the DistributedDataParallel module, the model can be synchronised across multiple processors, speeding up the processing of data and the model update process.
At the heart of the optimisation process is the computation of the loss function and its backpropagation. The loss function measures the difference between the current model output and the objective, and the goal of optimisation is to minimise this loss value. With the backpropagation algorithm, the gradient of the loss with respect to each parameter is computed and the optimiser is then used to update the parameters based on these gradients. This process is repeated in each training batch, progressively fine-tuning the model parameters to improve model performance.
In addition, to support interruption and resumption of model training, as well as adjusting and testing different training configurations, the code implements a save and load function for the model and optimiser states. This makes the training process more flexible and facilitates model debugging and deployment.
In the context of the code
model.parameters() provides all the parameters in the model that need to be trained.
lr is the learning rate, which controls the step size of the weight updates.
betas is the coefficient used to calculate the running average of the gradient and its square.
weight_decay is the weight decay, which is used to add additional regularisation to help prevent model overfitting.
10.3 Gradient Back Propagation
Firstly, the model receives input data such as images or other forms of feature data. The data is passed through the layers of the network, and each layer performs the appropriate computations, such as convolution operations, to produce the final output.
The difference between the model's output and the true (or target) value is calculated through a loss function. This loss value is a scalar that indicates the magnitude of the error in the current model output. Once the loss value is computed, the next step is to compute the gradient of the loss function with respect to the network parameters through the backpropagation algorithm.
Before starting a new round of gradient computation, the previously accumulated gradient information needs to be cleared. This is because the gradients are cumulative, and if they are not cleared, then the new gradients will be added to the old ones, which can lead to incorrect update directions.
To prevent the gradient from becoming too large and causing training instability, the gradient can be trimmed. This step usually restricts the gradient to a reasonable range to ensure training stability.
The calculated gradient is used to update the parameters of the network. This is done by tuning the parameters in the opposite direction along the gradient, with the exact amount of tuning determined by the learning rate. The learning rate is a key hyperparameter that determines the size of the step in which the parameters are updated in each iteration.MultiStepLR reduces the learning rate at predefined epochs (specified by milestones) by multiplying by the gamma.
(learning rate scheduler)
Speaking in conjunction with the code
Gradient backpropagation is handled in the training loop. This function is responsible for performing a training step that includes forward propagation, loss calculation, backpropagation and gradient trimming.
model.train() sets the model to training mode, enabling Batch Normalisation and Dropout.
optimizer.zero_grad() clears the gradient of all optimised tensors, as the gradient is cumulative by default.
loss.backward() computes the gradient of the loss function with respect to the weights.
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) Trims the gradient to prevent gradient explosion.
optimizer.step() Updates the network parameters based on the computed gradient.
11. Key technologies
11.1 Image feature extraction
Image feature extraction is achieved by means of a pre-trained deep convolutional neural network (CNN), specifically using MnasNet(Tan et al., n.d.), a lightweight but powerful network especially suited for mobile devices and edge computing scenarios. Here, MnasNet is used as a backbone network to extract a high-level feature representation of the input image.
The implementation process is explained in detail below:
The input RGB image is first normalised, where the network subtracts the mean of the pixels from the image and divides it by the standard deviation, which helps to speed up convergence during model training and improves the generalisation of the model.
The normalised image is fed into the MnasNet network, which automatically extracts deep features from the image through a series of convolutional layers, batch normalization layers and non-linear activation layers (e.g. ReLU). MnasNet outputs multi-scale feature maps, each of which captures a different level of abstraction of the image.
The extracted multi-scale features are then used in the subsequent 3D voxel reconstruction process to further process the spatial information and refine the reconstruction of 3D structures by combining with other modules (e.g., GRU Fusion, etc.).
MnasNet, as the core component of image feature extraction, provides the NeuralRecon framework with powerful image comprehension capabilities, and its lightweight and efficient characteristics make real-time 3D reconstruction possible. In addition, by automating the neural architecture search, MnasNet enables efficient use of computational resources, which is especially critical for real-time processing.
11.2 Sparse volume construction
Firstly, an all-zero feature volume feature_volume_all and counter count are initialised based on the number of voxel coordinates. these two variables are used to store the 3D features after backprojection and the number of times each voxel has been observed, respectively.
Perform. the following operations for the input image:
Extract the voxel coordinates, origin, image features and projection matrix for the batch.
Convert the voxel coordinates from voxel space to the world coordinate system, then to the camera coordinate system via the projection matrix, and finally project to the image plane to obtain the corresponding image pixel position for each voxel.
Use the grid_sample function to sample from the image features based on the pixel positions to get the corresponding 3D features. This step implements the mapping from 2D features to 3D features.
Next we need to collect the features and implement normalisation. For each voxel, the features obtained by sampling at all viewpoints are accumulated and the total number of times the voxel has been observed is calculated. The accumulated features are then divided by the number of observations to obtain the average feature, which reduces the impact of differences between viewpoints on the reconstruction results.
In order to keep the features consistent, we also need to normalise the depth values of each voxel by subtracting the mean depth value and dividing by the standard deviation.
Finally, the 3D feature volume of each voxel feature and the number of times each voxel has been observed in all views are output.
With the above steps, the function achieves the mapping from 2D image features to 3D sparse volume features, providing rich scene information for 3D reconstruction. This way of fusing information from multiple views can effectively improve the accuracy and robustness of 3D reconstruction.
(Figure 4 Code to implement mapping)
The final use is 3D sparse convolution implemented by torchsparse. The image backbone network is a variant of MnasNet, initialized with weights pre-trained from ImageNet. Except for the image backbone network, the entire network uses randomly initialized weights for end-to-end training.
11.3 GRU feature fusion
Before implementing GRU fusion I need to explain TSDF, a data representation method commonly used in the field of 3D reconstruction, which maps each point in a 3D space to a value that represents the distance from the point to the nearest surface, and whether this distance is positive or negative depending on whether the point is inside or outside the surface.
The RGB-to-TSDF network, which consists of a 2D encoder and a 3D decoder with fully convolutional layers, predicts TSDF. Two consecutive frames are aligned using the expected TSDF, which is obtained by jointly optimizing the objective function for an estimate of the camera posture. The global TSDF is then created by fusing the aligned TSDF(Kim, Moon and Lee, 2019).
In the step of GRU fusion the TSDF plays the role of an intermediate representation that integrates depth information from multiple views into a unified 3D spatial volume. Specifically, during the GRU fusion process, the hidden state, which represents the global TSDF volume, is updated by using a convolutional GRU model. Each new image fragment is first converted to its corresponding TSDF representation and then fused with the current global TSDF volume.
Firstly, I adopt a multi-layer structure to gradually refine the TSDF volume prediction, and 3D sparse convolution is used on each layer to efficiently process the feature volume. And at each level, each voxel of the TSDF volume is predicted by a multilayer perceptron (MLP) containing an occupancy score (indicating whether the voxel is inside or outside the surface) and a TSDF value (indicating the signed distance from the voxel to the nearest surface)(Newcombe et al., 2011).
To handle sparse representations, a truncation threshold is set to only consider voxels within the TSDF truncation distance. This allows for an efficient representation of the region near the surface and reduces the computational effort.
At each level, sparse TSDF volumes are first predicted, and then these volumes are used as inputs to the GRU fusion module at the next level, which combines the global hidden state (a fraction of the global TSDF volume) with local information about the current segment.
GRU fusion refers to the use of gated recurrent units (GRUs) for feature fusion of time-series data, and is implemented through the ConvGRU network module, which is mainly used for updating the hidden state features, as well as for directly replacing the corresponding voxels in the global TSDF volume, if needed. The process is divided into the following steps:
First, 3D geometric features Glt are taken from the hidden states Hlt-1. These hidden states are obtained using a deep convolutional neural network (we use MnasMulti) and contain geometric information from previous views. These features are then used to construct a sparse 3D feature volume.
A reset gate Rt is then input, which determines how much of the previous information needs to be discarded when updating the hidden state. It is computed using a sparse convolutional layer with a sigmoid activation function.
Meanwhile, the previous hidden state Hlt-1 and the current geometric feature Glt are input at the update gate Zt with the weight matrix Wz. Similar to the reset gate, the reset gate is also computed by a sparse convolutional layer with a sigmoid activation function. It serves to decide which information in the state needs to be updated.
Subsequently, the hidden states and current geometric features that have been processed by the reset gate are input, and a candidate hidden state ~Hlt is computed by a sparse convolutional layer with tanh activation.
The final hidden state Hlt is then computed by combining the update gate and the candidate hidden states. it is determined by the weighted sum of the historical hidden states and the candidate hidden states, with the weights controlled by the update gate Zt. The final hidden state is then computed by combining the update gate and the candidate hidden states. If the update gate value is close to 1, the new hidden state is closer to the candidate hidden state; if the update gate value is close to 0, the new hidden state retains more of the previous state information.
(Figure 5 Algorithm for calculating final hidden state)
GRU fusion is achieved through the above operations
(Figure 6 Implement GRU fusion)(Sun et al., 2021a)
The whole GRU fusion process emphasises the importance of fusing features over the time series, enabling the model to gradually refine and fine-tune the 3D reconstruction results while maintaining real-time performance. This approach not only effectively utilises the temporal information in the image sequence, but also significantly improves the efficiency and quality of the 3D reconstruction through incremental updating.
After achieving GRU fusion we need to integrate the global TSDF volume. In the last layer of the coarse-to-fine level, the predicted TSDF volume is further sparsified and then integrated into the global TSDF volume, this step is executed after the whole reconstruction process is finished, which ensures the global consistency and accuracy of the reconstruction results.
11.4 3D surface extraction
Firstly we still have to go through the previous steps to construct a TSDF volume that records the distance from each voxel point to the nearest surface (positive values indicate outside the surface, negative values indicate inside the surface) and limit the range of this value by truncating the distance to simplify the subsequent surface extraction process.
Next we need to extract the 3D surface from the TSDF volume using the Marching Cubes algorithm.The Marching Cubes algorithm is a typical surface extraction algorithm for volumetric data, which works by examining each cube cell (consisting of 8 neighbouring voxels) in the volume and determining, based on the TSDF values (positive and negative) of the cube's corner points, whether or not the cube crosses the the surface of the object(Newman and Yi, 2006).
For the determined shape of the intersection, the algorithm computes the location of the intersection of the cube boundary with the actual surface. This step is usually obtained by linear interpolation based on the TSDF values, ensuring that the positions of the surface vertices reflect the actual geometry as accurately as possible.
Triangular face sheets are then generated based on the determined shape of the intersection interface and the calculated intersection locations. Each triangular face sheet consists of three vertices with vertex positions derived from the above calculations.
Finally, the above process is repeated for each cube in the global TSDF volume, generating a large number of triangular face sheets. By merging these triangular face sheets, a 3D surface model of the scene is formed. By stitching together the triangular segments generated within all the cubes, a continuous 3D surface model is formed. This model provides a good approximation of the original scene geometry.
After the surface extraction, some post-processing operations, such as smoothing, removal of small isolated segments, etc., are usually also performed on the generated 3D model to improve the quality and visual effect of the model.
The key advantage of this approach is that it is able to process not only the depth information from a single viewpoint, but also integrate the information from multiple viewpoints from a video sequence to achieve a coherent and real-time 3D reconstruction of a complex scene.
12. Results and analysis
12.1 3D reconstruction results
(Figure 7 Realize 3D reconstruction)
12.2 Evaluate
In terms of reconstruction quality, common traditional methods such as feature-based SLAM methods often rely on strict geometric assumptions and manually tuned parameters, which is not only cumbersome but may also result in poorer reconstruction quality than deep learning-based methods in texture-poor or complex scenes.
Compared to traditional methods, NeuralRecon is usually able to generate high-quality 3D models, especially in texture-rich scenes. Thanks to deep learning, NeuralRecon is better able to handle noisy and incomplete data and can predict hidden surface areas by learning patterns in the dataset.
In terms of runtime, NeuralRecon is designed as a real-time system with a fast runtime, especially on powerful hardware. Its acceleration comes from the coarse-to-fine reconstruction method and efficient GPU acceleration. In contrast, traditional methods may require more computational resources and may not be real-time when dealing with large-scale scenarios.
On the level of resource consumption, NeuralRecon requires stronger GPU hardware support and consumes more computational resources in the training phase due to the use of deep neural networks. Whereas traditional methods traditional methods may have higher memory requirements because of the need to store the voxel mesh of the entire scene, but have a lower dependence on the GPU.