3.1 Synthetic Dataset with Simulated Bone Cutting
We collected a dataset \(\mathcal {D}\) containing depth captures of a cadaver leg with an intact femur surface. A commercial depth camera, Realsense D415, was used to be consistent with the original work [8]. The camera can acquire depth data with less than 2% error at up to 90 frames per second, which is competent for real-time applications [20] The dataset includes 9334 depth captures of lab scenes. Each depth image is represented by a \(160\times 160\times 4\) matrix. The first three channels in the last dimension are (x, y, z) coordinates of a sampled pixel, and the last channel is the binary label that denotes whether the point belongs to the femur surface.
Sawbones were used to generate realistic femur cutting patterns. Figure 1 shows the overall workflow. The original femur surface \(f_o\) was first captured and manually segmented from the depth frame. After cutting the femur around the condal following a conventional procedure for OCD removal, the modified femur surface \(f_m\) was again captured. Due to the possible change in spatial pose, \(f_o\) was registered to \(f_m\), and the depth value \(\tilde{z}\) of points fall in an annotated cutting A area was interpolated. \(\tilde{z}\) was subtracted by the depth value of \(f_m\) in the cutting area (i.e., \(\Delta z_A = z - \tilde{z}\)) and normalised. To ensure the smooth connection between modified and unmodified surface, zeros were padded around the edge of \(\Delta z_A\). The padded 3D variation was fitted by Clough Tocher 2D interpolation for a cutting pattern f. The same procedure was repeated 20 times to generate enough patterns with different cutting shapes and depth variations.
The intact femur points p were segmented from \(\mathcal {D}\) according to their binary labels. K (\(=1\)-3 in our trial) rectangular area were selected on the femur surface with the arbitrary size and location, to which the collected deformation patterns f were mapped and scaled by the arbitrary maximum intrusion depth (i.e., 0–15 mm). The original and deformed point clouds were separately resampled in a \(N\times 3\) point cloud and concatenated into a \(N\times 6\) array. Since the time of later reconstruction is proportional to the number of points N, we chose \(N=2500\) as a compromise between speed and surface representation quality.
3.2 Network Architecture
The surface reconstruction network is shown in Figure 2. The encoder borrows the PointNet design of sequential multilayer perceptrons (MLPs) and a max pooling layer to ensure the same loss expression regardless the order of input points. Note that for our need, no transformation (based on T-Net) is included for input and features, since it is designed to increase classification accuracy [19], and we want to keep the spatial consistency between input and output. After encoding the deformed points \(p'\), the most critical point among N points is selected for every latent dimension. The obtained latent feature vector is then passed to the decoder that consists of three fully convolutional layers to recover the \(N\times 3\) points \(\hat{p}\). The reconstruction loss is defined as the chamfer distance (CD) between the output \(\hat{p}\) and the ground truth (GT) intact points p:
$$\begin{aligned} \mathcal {R} = CD(\hat{p}, p)=\frac{1}{|\hat{p}|}\sum _{x\in \hat{p}}\min _{y\in p}||x-y||_2^2 + \frac{1}{|p|}\sum _{y\in p}\min _{x\in \hat{p}}||x-y||_2^2 \end{aligned}.$$
(1)
To ensure the spatial consistency, the reconstructed \(\hat{p}\) and GT p are fed to a pretrained PointNetLK to obtain a relative transformation T. We define a spatial regularisation loss \(\mathcal {S}\) as the mean-squared error (MSE) between T and an identity matrix. The overall cost function is defined as (where \(\lambda\) is a weight factor):
$$\begin{aligned} \mathcal {C} = \mathcal {R} + \lambda ~\mathcal {S} \end{aligned}.$$
(2)