|Computers, Materials & Continua |
Improved Lightweight Deep Learning Algorithm in 3D Reconstruction
1School of Mechanical Engineering, North China University of Water Conservancy and Hydroelectric Power, Zhengzhou, 450045, China
2Department of Electrical and Computer Engineering, University of Windsor, Windsor, ON, N9B 3P4, Canada
*Corresponding Author: Tao Zhang. Email: email@example.com
Received: 10 January 2022; Accepted: 04 March 2022
Abstract: The three-dimensional (3D) reconstruction technology based on structured light has been widely used in the field of industrial measurement due to its many advantages. Aiming at the problems of high mismatch rate and poor real-time performance caused by factors such as system jitter and noise, a lightweight stripe image feature extraction algorithm based on You Only Look Once v4 (YOLOv4) network is proposed. First, Mobilenetv3 is used as the backbone network to effectively extract features, and then the Mish activation function and Complete Intersection over Union (CIoU) loss function are used to calculate the improved target frame regression loss, which effectively improves the accuracy and real-time performance of feature detection. Simulation experiment results show that the model size after the improved algorithm is only 52 MB, the mean average accuracy (mAP) of fringe image data reconstruction reaches 82.11%, and the 3D point cloud restoration rate reaches 90.1%. Compared with the existing model, it has obvious advantages and can satisfy the accuracy and real-time requirements of reconstruction tasks in resource-constrained equipment.
Keywords: 3D reconstruction; feature extraction; deep learning; lightweight; YOLOv4
Optical three-dimensional (3D) measurement technology  is one of the most important research fields and research directions in optical measurement. As an important method of three-dimensional measurement technology, striped structured light technology can quickly and accurately obtain 3D point cloud data on the surface of the measured object, and is widely used in quality inspection, cultural relic protection, human-computer interaction, biomedicine and other fields [2,3]. The basic process of the measurement algorithm is as follows: project one or a group of structural fringes onto the surface of the object, the camera captures the fringe image modulated by the height of the object, and the relevant algorithm is used to calculate the phase information carried in the fringe; according to the phase and height, world coordinates and image pixel coordinates The mapping relationship between to get the final 3D information. Techniques such as fringe analysis, phase extraction and phase unwrapping have an important influence on the accuracy of 3D measurement. How to obtain high-precision depth information from the fringe image of the measured object is still the focus and difficulty of fringe projection 3D measurement technology.
Algorithms for obtaining depth information (or unfolding phase) from fringe images usually require two main steps: phase extraction represented by phase shifting and Fourier transform methods  and phase unfolding represented by spatial phase unwrapping and time phase unwrapping [5,6]. With the successful application of deep learning in the field of 3D measurement, 3D reconstruction technology [7–9] based on Convolutional Neural Network (CNN) has been continuously developed. The typical representative is Region-Convolutional Neural Network (R-CNN) series of algorithms based on region selection, but these methods take a long time to detect and cannot achieve the effect of real-time detection; Single Shot MultiBox Detector (SSD)  fusion multi-scale detection model has improved speed, but it detects small targets. Insufficient performance. The YOLO [11–14] series of algorithms are one of the most widely used algorithms in the field of deep learning. YOLOv1 has a fixed input size and has a poor detection effect on objects that occupy a relatively small area; YOLOv2 removes the fully connected layer and improves the detection speed; YOLOv3 obtained better detection performance, and can effectively detect small target objects, without a significant increase in speed. YOLOv4 is the fourth version of the YOLO series of algorithms, and its accuracy and speed have been significantly improved. With the increasing expansion of neural network model scale and increasing parameter scale, it needs to consume a lot of computing and storage resources, making it difficult to integrate into mobile terminals with limited resources, such as mobile phones and tablet computers.
Parameter compression on the constructed YOLOv4 model can well solve the contradiction between the huge network model and the limited storage space. The currently widely used model parameter compression method has weighted parameter quantization , Singular Value Decomposition (SVD) method  and so on.
The weight parameter quantization can achieve the purpose of reducing resource consumption by reducing the accuracy of the weight. For example, in common development frameworks [16–18], the activation and weight of neural networks are usually represented by floating-point data. Using low-level fixed-point data or even a small portion of training values to replace floating-point data helps reduce the bandwidth and storage requirements of the neural network processing system. The disadvantage is decreased data accuracy has caused a decrease in classification accuracy, and at the same time, the compression effect is difficult to improve. Peng  and others have greatly reduced the model parameters and resource occupation by adding the Ghost module and the Shuffle Conv module, but the accuracy is reduced by 0.2% compared with the original network. The SVD decomposition law achieves the purpose of reducing resource consumption by reducing the number of weights. Literature  proposed a global average pooling algorithm to replace the fully connected layer. Google Net uses this algorithm to reduce the scale of network training, and the removal of the fully connected layer does not affect the accuracy of image recognition. Google Net uses this algorithm to reduce the scale of network training, and the removal of the fully connected layer does not affect the accuracy of image recognition. The recognition accuracy of the algorithm in Image Net reaches 93.3%. At the same time, literature  proposed a 1*1 convolution kernel, which was successfully applied to Google Net and Res Net, which played a role in reducing the amount of parameters.
This paper uses the YOLOv4 network model to extract the features of the striped structured light image. Considering that the features of the striped image are not obvious due to the influence of illumination and noise, the feature extraction network model is improved. The algorithm first uses Mobilenetv3 structure to replace Cross-stage partial Darknet53 (CSPDarknet53) network of YOLOv4 to reduce the amount of backbone network parameters, and then introduces the Mish activation function and the CIoU loss function to calculate the improvement of the target frame regression loss, which effectively improves the generalization of feature extraction.
2 3D Reconstruction Algorithm
2.1 Stripe Structured Light 3D Reconstruction Algorithm
The principle of the fringe structured light 3D reconstruction algorithm is shown in Fig. 1. Assume that the light beam projected by the projection system intersects the reference plane at point B, which is imaged at point C on the camera image plane. When the object is placed, it suppose that another light beam intersects the object at point D, which is also imaged at point C in the camera image plane. For point C in the phase plane, there are two phase values before and after the object is placed. Therefore, the height h of point D can be derived from the phase difference.
The phase shift method is one of the commonly used methods of the fringe structured light 3D reconstruction technology. By projecting a series of fringe images with a phase shift of to the reconstruction target , the wrapped phase of the standard phase shift method is:
The wrapping phase is discontinuous, and the value range is between . The unfolding phase required in the subsequent three-dimensional reconstruction work is obtained by phase unwrapping. Phase unwrapping aims to recover the continuous phase from , and reconstruct the physically continuous phase change by adding or subtracting an appropriate multiple of 2π, thereby eliminate phase jumps. Therefore, the relationship between the unfolding phase and the wrapping phase is as follows:
Finally, the mapping expression between the unfolding phase and the height is determined and calibrated the mapping coefficients to realize the conversion of depth data and phase data of the measured object, and obtain the 3D topography information of the object surface.
2.2 YOLOv4 Network
YOLOv4 is mainly composed of Backbone, Neck and Head, as shown in Fig. 2. The Backbone part of YOLOv4 uses the CSPDarknet53 network, which is based on the Darknet53 network of YOLOv3 and formed by drawing on the ideas of CSPNet . The Neck part is composed of the Spatial Pyramid Pooling Networks (SPPNet) structure and Path Aggregation Network (PANet). SPPNet is a spatial pyramid pooling network that can increase the receptive field of the network, and the PANet network is a path aggregation network that realizes the integration of deep features and shallow features of the Backbone network. In the head detection part, the YOLOv4 algorithm uses the YOLOv3 detection head to perform two convolution operations with a size of 3 × 3 and 1 × 1 to complete the detection.
2.3 Network Model Compression
YOLOv4 network model is improved from two aspects: using the MobileNetV3 structure to replace the backbone feature extraction network of YOLOv4, and greatly reducing the number of backbone network parameters through the deep separable convolution in Mobilenetv3; introducing Mish activation function and CIoU loss function calculation to improve target frame regression loss, effectively improve the generalization of feature extraction.
YOLOv4 algorithm uses the CSPDarknet53 network as the feature extraction network, which contains 5 residual blocks, which are respectively stacked by 1, 2, 8, 8, and 4 residual units. The algorithm has a total of 104 convolutional networks, including 72 convolutional layers, and uses a large number of standard 3 × 3 convolution operations. A large amount of computing resources are used in the calculation process, which makes it difficult to achieve real-time performance. With the transfer of multi-layer features, more convolutional layers will gradually reduce the ability of local refined feature extraction, which affects the detection performance of the algorithm for small features. Therefore, it is necessary to improve the YOLOv4 feature extraction network to meet the small target detection and real-time requirements.
The MobileNet network uses the depth separable convolution calculation to convert the traditional convolution into a deep convolution and a 1 × 1 dot convolution, and introduces a width multiplier and a resolution multiplier to control the amount of model parameters. Mobile NetV3 is the third generation of Mobile Net network development. It combines the deep separable convolution method in MobileNetV1, the Inverted Residuals, Linear Bottleneck and the Squeeze-and-Excitation (SE) attention mechanism in MobileNetV2. MobileNetV3 uses neural architecture search (NAS) to search for network configuration and parameters, while improving the swish activation function to reduce the amount of calculation for h-swish, which can achieve less calculation and higher accuracy. The Mobile Net network first uses three 3*3 convolution kernels to convolve with each channel of the input feature map to obtain a feature map with an input channel equal to the output channel, and then uses N 1*1 convolution kernels to convolve this feature map to obtain a new N-channel feature map. Compared with the CSPDarknet53 network, it not only maintains a relatively powerful feature extraction capability, but also reduces the size of the model to a large extent, making it more convenient to deploy in the mobile terminal of the industrial field. At the same time, it has less network depth than the CSPDarknet53 network, which can better extract local refined features and improve the feature detection performance of small targets.
The model is trained with a self-regular non-monotonic Mish activation function, which can ensure the effective return of training loss, and obtain better generalization ability and more accuracy while ensuring the convergence speed. The calculation formula is:
where x is the input of the activation layer, and is the output of the activation layer.
In order to detect the target more accurately, the training loss is composed of the weighted sum of bounding box regression loss, confidence loss and classification loss, and calculates the return gradient. The calculation formula is:
where L represents training loss; represents bounding box regression loss; represents target confidence loss; represents category classification loss; represents bounding box regression loss weight coefficient; S represents the number of grids; B represents anchor point candidates generated by each grid Box; indicates that there is a target; indicates the boundary loss measured by CIoU.
affects the proportion of the bounding box regression loss in the overall training loss, which can improve the detection accuracy. The calculation of confidence loss is as follows:
where is the confidence loss weight coefficient; indicates no target; is the loss weight coefficient corresponding to each category target; is the confidence of the i-th grid; is the target confidence.
By changing , the influence weight of the confidence loss in the entire training loss can be adjusted; by changing , the influence weight of samples of different categories in the training loss can be set, so as to be compatible with categories with fewer training samples to solve complex problems.
3 Experiment and Result Analysis
In order to verify the reliability of the algorithm and the effect in the actual measurement, a set of grating three-dimensional projection measurement system composed of a projector and a camera was built, as shown in the Fig. 3. The resolution of the camera (Hikvision MV-CA060-10GC) is 3072*2048, the resolution of the projector (BenQ es6299) is 1920*1200, the high-speed vision processor (CPU i9-10900X, 3.7 GHz, 4.5 GHz Turbo, memory 64 GB DDR4, 32-bit Windows operating system).
The experimental steps are as follows:
(1) Generate sine grating fringes, where a four-step phase shift fringe pattern is used.
(2) Project the sine grating fringe pattern to the homogeneous whiteboard, and collect the grating fringe modulated by the surface of the object.
(3) Use the training data to train the YOLOv4 network model to obtain the mapping between the fringe image and the depth image.
(4) Use the trained network to obtain the depth data of the fringe image.
For deep learning network training, the training rounds are uniformly set to 100, the batch size is 16, the initial learning rate is 1e-3, and the initial weights are all set to 1. There are a total of 5012 photos in the training set. In each round of training, 90% of the photos are used for training, and the other 10% of the photos are used for real-time detection of the training effect. This experiment will select a set of weight files with the lowest loss in each round to compare mAP size, model size, and real-time detection Frames Per Second (FPS).
As shown in Tab. 1, the model size of standard YOLOv4 is about 220 M, and FPS is 6.33. After replacing CSPDarknet53 with Mobilenetv3, the model size further decreased to only 50 M, FPS increased to 14.35, but mAP also dropped to 77.48%. It can be concluded that although Mobilenetv3 can greatly simplify the network structure, mAP will also be greatly reduced. After the improved model is used in the algorithm in this paper, mAP increases to 82.11% and the model size becomes 52 M, and the FPS is 13.67. Although the algorithm causes the model to become slightly larger and the FPS to drop slightly, it ensures a higher mAP.
According to the built experiment system and trained deep learning model, 3D reconstruction calculations are performed on objects with a simple shape and a complex shape respectively. The experiment uses a high-speed visual processor for training, and uses pre-training weights to train the original YOLOv4 network and the improved YOLOv4 model in this article. Finally, the results of the above three models are compared. Fig. 4 is the simple shape image of the pony spoon inputted by the test, respectively taking 4 fringe images with different phases. Fig. 5 is the final optimal depth image. Tab. 2 is the 3D reconstruction effect of the three models.
Fig. 6 is the complicated shape image of the human face inputted by the test, respectively taking 4 fringe images with different phases. Fig. 7 is the final optimal depth image. Tab. 3 is the 3D reconstruction effect of the three models.
By simulating the 3D reconstruction process of two different objects, compared with the simple model of first example, the model of second example is more complex, has more abundant fringe features, and be convenient to obtain the phase change, so it is better than the first example in reconstruction accuracy and speed. At the same time, it can be concluded from the simulation results of the three algorithms: the lightweight YOLOv4 model in this paper is superior to the other two models in terms of average phase error, point cloud restoration rate and running time, but it still needs further research at the sub-pixel level in the detail reconstruction.
Based on the 3D model of striped structured light construction, this paper proposes a stripe image feature extraction algorithm based on lightweight YOLOv4. The advantage of this model is that it uses a lightweight Mobile Net network to replace the CSPDarknet backbone network in YOLOv4, which simplifies the network structure and improves the real-time performance of model detection; uses the Mish activation function and the CIoU loss function to calculate and improve the target frame regression loss, which is effective Improved feature detection accuracy and real-time performance. The experimental results show that, compared with the existing 3D reconstruction methods, the depth information calculated by the proposed method has higher accuracy and improves the accuracy of the 3D measurement results of fringe images. Therefore, it can be effectively used in the field of fringe projection 3D measurement and is better to meet the needs of 3D shape measurement of objects in scientific research and practical applications. The next step will continue to study the effectiveness of the proposed method in other more experimental scenarios, such as the effectiveness and accuracy of the fringe image depth estimation in the case of colored objects, high-light objects, and projection out-of-focus conditions. On the other hand, the generalization ability of the model is a common problem in deep learning, and it is also a key issue that needs to be paid attention to in the next work to improve the proposed method.
Acknowledgement: The authors thank Dr. Jinxing Niu for his suggestions. The authors thank the anonymous reviewers and the editor for the instructive suggestions that significantly improved the quality of this paper.
Funding Statement: This work is funded by the Training Plan for Young Backbone Teachers in Colleges and Universities in Henan Province under Grant No. 2021GGJS077.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.|