Weakly-Supervised Single-view Dense 3D Point Cloud Reconstruction via Differentiable Renderer

In recent years, addressing ill-posed problems by leveraging prior knowledge contained in databases on learning techniques has gained much attention. In this paper, we focus on complete three-dimensional (3D) point cloud reconstruction based on a single red-green-blue (RGB) image, a task that cannot be approached using classical reconstruction techniques. For this purpose, we used an encoder-decoder framework to encode the RGB information in latent space, and to predict the 3D structure of the considered object from different viewpoints. The individual predictions are combined to yield a common representation that is used in a module combining camera pose estimation and rendering, thereby achieving differentiability with respect to imaging process and the camera pose, and optimization of the two-dimensional prediction error of novel viewpoints. Thus, our method allows end-to-end training and does not require supervision based on additional ground-truth (GT) mask annotations or ground-truth camera pose annotations. Our evaluation of synthetic and real-world data demonstrates the robustness of our approach to appearance changes and self-occlusions, through outperformance of current state-of-the-art methods in terms of accuracy, density, and model completeness.


Introduction
The inference of underlying object or scene geometry is among the classical goals of computer vision and graphics, and a fundamental prerequisite for numerous applications in entertainment, robotics, navigation, and architecture. Examples include guidance of robot interactions with objects in a scene based on their shape, as well as augmented and virtual reality solutions for gaming, interior design [1], remote collaboration [2][3][4] and teleoperation [5,6]. The geometry reconstruction is also significant for microscopic scale objects. Such as the surface morphology inference based on the surface profile reconstruction [7] is served for the assembling deviation estimation [8] and analysis of the replacement of the actual machining surface [9].
Besides the well-established multi-view approaches, such as multi-view stereo [10], structure-from-motion (SfM) [11], simultaneous localization and mapping (SLAM) [12] and single-view-based three-dimensional (3D) scanning based on structured light systems [13] or laser scanners [14], more approaches are now focusing on learning-based scene representation schemes [15], especially for single-view scenarios. When taken into account, prior knowledge derived from large-scale datasets can yield remarkable reconstruction results from single images [16].

Open Access
Chinese Journal of Mechanical Engineering the benefits of structured input data, with the limitation of representing surface information with relatively few voxels. Hence, the granularity of the reconstruction result is strongly limited by the computational burden and memory consumption associated with 3D CNNs. Furthermore, considering structured input data in terms of the point connectivity of meshes is more efficient due to the direct consideration of points on the surface; however, it is non-trivial to efficiently integrate the connectivity information in the training process. In turn, unstructured point clouds offer the aforementioned advantage of direct representation of the surface with high granularity, without the need to consider the connectivity between points during training; however, the lack of any grid structure and permutation invariance must be considered within point cloud specific architectures and loss definitions [37][38][39][40]. Key challenges include the generation of dense point clouds to avoid incomplete object representation, has a high computational burden and high memory requirements. The reconstruction quality of single image-based approaches depends heavily on the available training data. In general, impressive single image-based reconstruction results have been obtained using large datasets of ground truth annotations. Obtaining perfect 3D computer-aided design (CAD) models as ground truth data for real-world environments is highly challenging; therefore, several approaches have focused on weakly supervised [25,41,42] or unsupervised [43,44] learning to reduce/mitigate the need to acquire 3D ground truth data for explicit supervision. However, neural scene representation and rendering, as applied in Ref. [43], does not well represent the 3D structure, thereby limiting the quality of 3D structure recovery from a small number of observations. The structure-aware scene representation network presented in Ref. [44] encodes both geometry and appearance; however, the applied ray marcher cannot accommodate surfaces with holes and boundaries of selfoccluding structures, as commonly encountered in the 'chair' object category. Nevertheless, these approaches show potential, particularly for multi-view observations.
The key proposition of this paper is that an accurately predicted shape should provide reasonable depth estimates from any viewpoint. For this purpose, we take the depth maps as supervision signals and propose a novel weakly supervised approach to reconstruct a dense 3D point cloud. Given the input of a single red-green-blue (RGB) image, we use an encoder-decoder architecture to first encode the RGB information in a latent representation and then predict the 3D structure of the considered object from different viewpoints. Then, we combine these individual 3D structure predictions into a common coordinate system to reconstruct the point clouds, and further synthesize the depth maps from novel viewpoints to optimize the two-dimensional (2D) prediction error.
Most optimization processes [25,42,45,46] rely on the availability of ground truth data for novel viewpoint poses. For instance, Navaneet et al. [45] and Lin et al. [46] specified the viewpoint poses for CAD models. Tulsiani et al. [25] and Gwak et al. [42] trained models based on viewpoint pose annotations. Developing setups for low-cost object digitization without the requirement for expensive annotations or calibration requires that these restrictions be overcome. Therefore, we designed a differentiable rendering module and combined it with a pose estimation network to identify the poses for novel viewpoints. The rendering module is capable of handling the appearance changes and self-occlusions that may occur from certain viewpoints, and can estimate the camera poses even with large baselines, which makes it possible to randomly set the novel viewpoints.

Structure Estimation
In an initial step, we aim to derive a dense 3D point cloud representation from a single RGB image acquired from an arbitrary view. For this purpose, we attempted to leverage the potential of deep learning for generative 3D modelling. Key challenges include efficient and accurate 3D representation of the considered object, as well as the design of a pipeline that allows end-to-end-learning without requiring annotated data. The proposed pipeline is shown in Figure 1.
To meet these challenges, we use an encoder-decoder architecture that encodes information from the input image in a latent scene representation that, in turn, allows 3D structure predictions D v (v ∈ V ) from V different fixed viewpoints. These view-dependent structures D v correspond to an image-based representation of 3D point cloud coordinates x i = (x i , y i , z i ) according to the respective viewpoint v . Representing point clouds in terms of multi-channel images allows for the use of 2D convolutions instead of memory-intensive 3D convolutional operations for calculation of the volumetric structure. During training, the encoder learns the latent scene representation, and the decoder learns to generate 3D structures D v from that representation. Finally, the structure images x i = (x i , y i , z i ) predicted for different viewpoints v ∈ V are fused into a single 3D point cloud p i ∈P V by transforming the respective view-centric point cloud coordinates into a common world coordinate system (WCS) according to where K denotes the camera calibration matrix, and the views are specified based on pairs of rotations and translations (R v , t v ) . Thus, applying (R v , t v ) shifts a point from the world coordinate frame to the view-centric coordinate frame of view v , and the inverse transform is then applied to transfer points from view-specific coordinate frames to the global coordinate system.
Note that training the StructureCNN does not rely on ground truth annotations of 3D structures D v or 3D shapes P V for direct supervision as required in the approach of Lin et al. [46]. Instead, we jointly train the structure network and a component that optimizes 2D projection errors and the camera pose prediction.

Optimization Based on 2D Projections from Multiple Views
The 3D point cloud reconstruction obtained by fusing the multi-view structure predictions from the aforementioned structure network is noisy and needs further optimization. Further optimization of our point cloud avoids the need for novel viewpoint pose annotations by integrating a pose estimation network into the designed differentiable rendering module.

Differentiable Rendering Module
The renderer represents the forward imaging process of a camera. In our pipeline, the renderer takes the reconstructed point cloud P V as the input to render depth images D n for novel views (R n , t n ) , which are then used for 2D projection optimisation to minimise the depth errors L N = N n=1 ||D n −D n || 1 . Here, the image coordinates x of the individual points of the common point cloud under the view (R n , t n ) are obtained according to This process can be inverted. Given the depth information and respective image coordinates, the points of the surface parts visible in a particular view can be reconstructed by back projection of the 2D depth maps.
Unlike the approach of Fan et al. [35], wherein the number of points in the point cloud is fixed and predefined, our approach allows for the generation of dense and disordered points varying in number for different objects. As shown in Figure 2(a), many points may project onto the same pixels, and the resulting discretization may reduce the image quality. Lin et al. [46] a pseudo-renderer to resolve this issue based on upsampling the depth image by a factor U, such that points are projected onto isolated pixels (Figure 2(b)). This results in high memory consumption. Unfortunately, as the point clouds are dense, different points may still be projected onto the same pixels. The optimization does not consider these pixels, and their gradients do not contribute to the outputs; this resulting in more outliers. Furthermore, discretization error of the 2D projections of the point clouds caused by rounding, according to the depth image resolution [46], means that the model undifferentiated among view poses.
To obtain better point cloud optimization results, we only store the point with minimum depth ẑ i = min j∈sẑj in the respective image x i = (x i ,ŷ i ,ẑ i ) in cases of s projected points per pixel (Figure 2c). In other words, we only consider visible aspects in the respective views. All projected pixels contribute to the optimization process, to reduce the influence of outliers on the results. Furthermore, to also achieve differentiability with respect to the viewpoint, we compute the ground-truth depth value d i corresponding to the rendered depth value d i =ẑ i at location (x i ,ŷ i ) by bilinear interpolation: (2021) 34:93 As shown in Figure 3, where d 11 , d 12 , d 21 , and d 22 are the depth values of the local four-pixel neighborhood on the ground truth, d i is approximated by two linear interpolations, d d and d u . The bilinear sampling at location (x i ,ŷ i ) is differentiable with respect to the camera pose (R n , t n ) , and the reconstructed point p i is augmented in Eq. (2), such that the framework is differentiable with respect to point cloud generation and viewpoint pose prediction, and can be trained end-to-end.

Camera Pose Estimation Network
The above designed renderer can use different camera poses (R n , t n ) , so we can estimate the poses in novel views. For this purpose, we integrate a pose estimation network into the rendering module. Here, we use a convolutional network (PoseCNN) that takes the depth maps D n (n ∈ N ) at N novel viewpoints as input and estimates their respective poses (R n , t n ) . This allows us to avoid dependence on pre-defined depth maps with pose annotations, as there are likely to only be depth maps with unknown poses available for supervision.
As illustrated in Figure 4, we take the depth maps D f with known poses as references, to train the PoseCNN and estimate the poses (R n , t n ) in N novel views. In theory, only one reference D f with an unknown pose can be used to successfully train PoseCNN. We can take the local coordinate system of D f as the WCS and estimate other camera poses with respect to D f . As the larger number of D f contributes to the pose estimation accuracy (see the experimental results in Section 4.2.2), we use eight reference views D f in this paper.
The point cloud P N fused with accurate estimated camera poses is expected to align with the ground-truth point cloud P N ; the Euclidean distance between them is very small. There are 3D metrics for comparing point clouds, such as the Chamfer distance [35], which determines the distance from each point to the nearest neighbor in another set of point clouds. To avoid the need for a costly 3D-based optimization using computationally intensive 3D metrics, the rendered depth map D f should be consistent with D f . The 2D optimization based on minimizing the L 1 loss between D f and D f is more efficient. Furthermore, the 3D metrics are invalid when there are dramatically different appearances between views, as shown in Figure 5(a), in which two views with opposite orientations capture completely different aspects of the scene. Here, the 3D optimization will incorrectly estimate the novel view and will instead largely coincide with the reference view, as shown in Figure 5(b). The proposed 2D optimization is effective for this situation and robust to appearance changes and self-occlusion, as verified experimentally (see Section 4.2.2).

Implementation Details
We used the most recent and relevant research results of Lin et al. [46] and Navaneet et al. [45], based on their state-of-the-art single-view point cloud reconstruction methods, as the baseline and prepared identical datasets to allow comparison with our proposed method. The details of the experimental setup and qualitative and quantitative results are as follows.

Data Preparation
(1) Synthetic dataset: ShapeNetCore [47] contains about 55 object categories, from which a subset of 3D models is used for experimental evaluations. For each 3D model, we rendered 24 RGB images ( 64 × 64 × 3 ) with azimuth angle steps of 15 • and elevation angles 30 • , 100 depth maps D n (64 × 64 ) at random novel viewpoints, and eight 3D structures D v at fixed viewpoints (i.e. the eight corners of a central cube). More D v will inferred denser point clouds. And D n located in supervised views, should capture more details and there are almost no occluded areas in the field of views.
(2) Real-world dataset: Pix3D [48] contains real images and corresponding 3D CAD models. We selected four categories for our experiment, i.e. 'bed' , 'chair' , 'desk' , and 'sofa' , rendered D n , and generated ground-truth point clouds based on the CAD models. Additionally, we tested the 'chair' object category from the Stanford Online Products dataset [49].

Network Architecture
We designed 2D convolutional neural networks. As shown in Figure 6, StructureCNN and PoseCNN share the same encoder architecture. The encoder consists of four convolution layers having 96, 128, 192, and 256 channels, and three fully connected layers having 2048, 1024, and 512 neurons. For StructureCNN, the decoder consists of three fully connected layers with 1024, 2048, and 4096 neurons. The feature maps are rescaled by nearest neighbor interpolation, followed by convolution layers. Batch normalization and rectified linear unit (ReLU) layers were added between the convolution layers. The fixed filter size was 3 × 3 . There were two and one strides in the encoder and decoder, respectively. PoseCNN used two fully connected layers with 64 and 7 neurons each. The outputs included a quaternion and the x, y, z position of the viewpoints. The PoseCNN can predict the viewpoints of the depth maps scattered in supervised views, which facilitate the training of the StructrueCNN. After the training strategy, the point cloud of an object can be inferenced by feeding the single RGB image into the StructureCNN.

Training Paradigm
As the inferred viewpoint in initial training iterations is often inaccurate, which will result in the learned point cloud unmeaningful. Thus, learning these together is susceptible to local minima. Followed by the suggestions of prior work [34], a two-step training paradigm was employed. First, we trained PoseCNN with eight fixed viewpoints taken as references ′ v corresponds to the rendered result. Then, we trained StructureCNN based on estimated novel viewpoint poses. An RGB image randomly selected from 24 views was used as input for each iteration. In Eq. (4), the loss is defined as the L 1 distance. We used TensorFlow to implement our framework, with a learning rate of 0.0001 and ADAM optimisation.

Experimental Design
The experiments mainly addressed three questions: (1) the accuracy and robustness of the viewpoint pose prediction, (2) the performance of the point cloud reconstruction for a single object category, and (3) the generality of the proposed framework to multiple and unseen categories.
For the first two questions, we trained and tested the network on the 'chair' category. For the third question, we trained and tested the network on multiple categories. All of the datasets were randomly split into training and test sets (80% and 20%, respectively). We also tested unseen categories.

Accuracy of the Pose Prediction
We used the eight fixed views as a reference to estimate 10 random novel viewpoints. Table 1 shows the averaged results of the test split. The camera orientation was represented by a quaternion. The error is the angle between the optical axes of the camera for the estimated and GT results. The largest error was 0.340 • . According to the following results, the pose prediction was sufficiently accurate to guarantee point cloud reconstruction accuracy.

Robustness of the Viewpoint Pose Prediction
We used eight fixed views to evaluate the robustness of the viewpoint pose prediction and the impact of the number of reference views on the results. Figure 7(a) shows the eight fixed views; view 3 was selected as the reference. The estimated poses are listed in Table 2. Relative to view 3, to some degree there are appearance changes in the other seven views; for view 6 in particular in Figure 7(b), the appearance is completely different to that of the reference. Beside the appearance changes, every image has self-occlusions caused by the arms or legs of the chair. The orientations are all estimated accurately, indicating that the proposed renderer can not only differentiate among viewpoints, but is also robust to appearance changes and occlusions.
The accuracy of the results shown in Table 2 was lower than that of those in Table 1, indicating that the number of reference views affects the pose estimation accuracy. Figure 8 shows the training process for view 8 pose estimation according to the number of reference views. The learning rate was 0.001. Figure 8 shows the first and final 50 iterations of the training process. The Y-axis is the angle between the optical axes of the estimated and GT results, where a greater number of reference views will improve convergence speed and accuracy. However, using a very high number of views is redundant and expensive; thus, the proposed framework used eight reference views.
The robustness and effectiveness of the viewpoint prediction allowed the reference and novel views to be set flexibly, without considering appearance changes or occlusions. Figure 9 shows the 3D point clouds generated for the chair test split. The reconstruction errors are defined by the point-wise 3D Euclidean distance using Eq. (5), which represents the 3D shape similarity [50]; P and P are generated and ground truth point clouds, respectively. According to Table 3, E is scaled by a factor of 100; our results are more accurate.

Point Cloud Reconstruction for a Single Object Category
Although, the network of Lin et al. [46] is pretrained based on the GT 3D structures, the training process does not consider pixels with more than one projection, which leads to outlier points, as shown in Figure 10(a). Navaneet et al. [45] also calculated the gradient of each pixel for optimization and obtained fewer outliers. However, they considered the masks as 2D observations and failed to resolve the concavity or finer details, as shown in Figure 10(b). We successfully generated these structures.

Figure 10
The results of outliers and concavity

Training/testing on Multiple Categories Using ShapeNetCore
The categories included 'airplane' , 'bed' , 'bench' , 'bus' , 'chair' , and 'rifle' . The qualitative and quantitative results of the test split are shown in Figure 11 and Table 4.
For convex objects, such as a bus, the results of Navaneet et al. [45] are comparable to our own. For concave objects and finer details, such as the arms on chairs and rifles, our network is more effective.

Testing Out-of-category in ShapeNetCore
The ability to generalize prior learning for seen categories to unseen categories is important for the intelligent agent. Within the training set, the motorbike and car are completely novel categories; as there are few instances with similar shapes, they were used for the out-of-category tests. The results are shown in Figure 12 and Table 5. Many finer structures were resolved, such as the wheel of the motorbike and the tailgate of the car and, compared to Lin et al. [46], there were fewer outliers. Navaneet et al. [45] simply reconstructed the bounding boxes of the objects. We were also largely able to reconstruct these structures.

3D Reconstruction Using Pix3D and Stanford Online Products Datasets
To deal with real images, the proposed framework was further fine-tuned using the Pix3D dataset. We assumed a default intrinsic matrix with an orthographic camera K . The qualitative results for the bed, chair, desk, and sofa categories are illustrated in Figure 13, and the quantitative results are presented in Table 6. Despite the items on the bed and desk that heavily occluded the objects, we still effectively captured the finer details and concave structures.
As there are no CAD models in the Stanford Online Products dataset, we cannot generate ground-truth point clouds, so instead manually selected some untruncated instances for qualitative tests. The results are illustrated in Figure 14.
Based on the above experimental results, the proposed framework offers only a slight advantage for single category reconstruction. For multiple classes, the advantages of our framework are obvious. Especially for out-ofcategory and real images, we successfully reconstructed the concavity and finer structures. Furthermore, across different experimental settings, including single, multiple, and unseen categories of rendered and real-world data, the error rates were similar, at 1.61, 1.571, 2.200, and 1.51. The accuracy was higher for multiple-versus single-object cases. Overall, the visual and quantitative results demonstrate that the proposed framework has better generalization ability for synthetic and real-world domains.

Conclusions
We introduced an approach for complete 3D point cloud reconstruction from a single RGB image.
(1) We combined an encoder-decoder framework, for generative structure prediction from a single RGB image, and an optimization framework based on a differentiable renderer module, whereby the training is supervised through 2D observations in novel views. (2) By adding a pose estimation network, the renderer is designed to be differentiable for both point cloud reconstruction and viewpoint pose prediction, which allows end-to-end training and avoids the need for viewpoint pose, structure, or mask annotations in the datasets. (3) Experimental results for synthetic and real-world datasets demonstrated that our approach is robust to appearance changes and self-occlusions, and shows superior accuracy, density, model completeness, and generalization potential compared to state-of-the-art methods.