Automatic Bone Surface Restoration for Markerless Computer-Assisted Orthopaedic Surgery

An automatic markerless knee tracking and registration algorithm has been proposed in the literature to avoid the marker insertion required by conventional computer-assisted knee surgery, resulting in a shorter and less invasive surgical workflow. However, such an algorithm considers intact femur geometry only. The bone surface modification is inevitable due to intra-operative intervention. The mismatched correspondences will degrade the reliability of registered target pose. To solve this problem, this work proposed a supervised deep neural network to automatically restore the surface of processed bone. The network was trained on a synthetic dataset that consists of real depth captures of a model leg and simulated realistic femur cutting. According to the evaluation on both synthetic data and real-time captures, the registration quality can be effectively improved by surface reconstruction. The improvement in tracking accuracy is only evident over test data, indicating the need for future enhancement of the dataset and network.


Introduction
Osteochondral defect (OCD) represents a common joint disease that causes great pain and discomfort in patients.62% of the patients requiring arthroscopic intervention for knee pain report the OCD problem [1].For severe defects that cannot be treated conservatively, synthetic implant replacement is an effective option.The potential of computer assistance in improving surgical outcome has been well recognised.Following the pre-operative plan displayed by the navigation system, the implant can be placed in high congruency with the surrounding anatomy [2].It is vital to dynamically localise the target knee in a spatial coordinate, so that the computer-generated model initially registered onto the patient can be updated accordingly for surgeons' reference.Conventionally, optical markers are rigidly inserted into the target bones to infer the target movement from the marker tracking.Albeit the high tracking precision, the marker preparation and insertion lead to extra human-induced errors [3], longer surgical workflow [4,5] and most importantly, higher invasiveness that may cause infection, nerve injury, and bone fracture to patients [6,7].
Research effort has been recently allocated to automatic femur tracking for markerless knee surgery navigation [8][9][10].These methods use a pre-trained convolutional neural network to segment the femur points from the RGBD captures by a commercial depth camera, and then register the segmented points to a 3D reference model created from pre-operative scanning to obtain the real-time spatial pose.However, these works only consider the intact target geometry: given the inevitable surgical interventions such as the bone drilling for lesion removal, the segmented surface points deviate from the pre-scanned intact surface.The increased spatial mismatch will degrade the registration quality (i.e., fitness and inlier matching error) and accuracy (i.e., registered spatial pose compared to the ground-truth pose), making the overall markerless tracking less reliable.
To compensate the spatial inconsistency caused by femur processing, we proposed a supervised surface restoration network based on the PointNet backbone.The network aims to improve the geometric consistency between the segmented surface and the intact model, while retaining the spatial pose of the input point cloud.Since modifying a large number of bone surfaces and capturing the pairwise images with the same target-incamera pose is highly costly and tedious, we developed a synthetic dataset with simulated realistic surface deformation applied over the collected camera captures.The performance of our trained network was evaluated on both the test dataset and a model knee physically sampled by a depth camera in real-time.

Femoral Surface Processing in OCD Treatment
With the degradation of articular cartilage caused by the overload or cyclical episodes on the knee joint, the subchondral bone may collapse, causing a lesion to form in the smooth joint surface [11].OCD mainly affects the femoral condyle and/or patellofemoral articulation of the knee [12].During OCD treatment, the affected area needs to be removed and processed in a regular shape (e.g., ellipse) to be compatible with an optimal implant chosen from the library built through the statistical modelling of real OCD dataset [2].

Learning-Based Surface Reconstruction
With the fast development of deep learning and the availability of large 3D datasets, many generative models have been proposed to reconstruct geometries with the implicitly learned structural features [13].Most of the proposed networks are based on a vanilla architecture [14]: an encoder extract the latent features of the input data first.Then, the encoded features are depicted by a decoder to generate desired outputs [15][16][17].The encoder usually consists of fully convolutional layers, followed by optional pooling, activation and/or fully connected layers.The decoder either contains deconvolutional layers or fully connected layers for upsampling.
The 3D geometry can be recorded in volumetric voxels, surface meshes, point cloud or multi-views.Among them, the point cloud-based learning offers high flexibility and efficiency in computation and memory.PointNet, based on the same encoder-decoder design, is regarded as the most popular backbone for point cloud-based learning [18].It adopts three unique designs: a symmetry function to deal with unstructured input, a local and global information aggregation layer and a joint alignment network.Inspired by PointNet, many new extensions and variants exist.An example is PointNetLK, a network trained for the point cloud registration [19].By integrating a modified Lucas&Kanade (LK) algorithm in the PointNet framework, the spatial transformation can be depicted by a differentiable learning framework.

Synthetic Dataset with Simulated Bone Cutting
We collected a dataset D containing depth captures of a cadaver leg with an intact femur surface.A commercial depth camera, Realsense D415, was used to be consistent with the original work [8].The camera can acquire depth data with less than 2% error at up to 90 frames per second, which is competent for real-time applications [20] The dataset includes 9334 depth captures of lab scenes.Each depth image is represented by a 160 × 160 × 4 matrix.The first three channels in the last dimension are (x, y, z) coordinates of a sampled pixel, and the last channel is the binary label that denotes whether the point belongs to the femur surface.
Sawbones were used to generate realistic femur cutting patterns.Figure 1 shows the overall workflow.The original femur surface f o was first captured and manually segmented from the depth frame.After cutting the femur around the condal following a conventional procedure for OCD removal, the modified femur surface f m was again captured.Due to the possible change in spatial pose, f o was registered to f m , and the depth value z of points fall in an annotated cutting A area was interpolated.z was subtracted by the depth value of f m in the cutting area (i.e., z A = z − z ) and normalised.To ensure the smooth connection between modified and unmodified surface, zeros were padded around the edge of z A .The padded 3D variation was fitted by Clough Tocher 2D interpolation for a cutting pattern f.The same procedure was repeated 20 times to generate enough patterns with different cutting shapes and depth variations.
The intact femur points p were segmented from D according to their binary labels.K ( = 1 -3 in our trial) rectangular area were selected on the femur surface with the arbitrary size and location, to which the collected deformation patterns f were mapped and scaled by the arbitrary maximum intrusion depth (i.e., 0-15 mm).The original and deformed point clouds were separately resampled in a N × 3 point cloud and concatenated into a N × 6 array.Since the time of later reconstruction is pro- portional to the number of points N, we chose N = 2500 as a compromise between speed and surface representation quality.

Network Architecture
The surface reconstruction network is shown in Figure 2. The encoder borrows the PointNet design of sequential multilayer perceptrons (MLPs) and a max pooling layer to ensure the same loss expression regardless the order of input points.Note that for our need, no transformation (based on T-Net) is included for input and features, since it is designed to increase classification accuracy [19], and we want to keep the spatial consistency between input and output.After encoding the deformed points p ′ , the most critical point among N points is selected for every latent dimension.The obtained latent feature vector is then passed to the decoder that consists of three fully convolutional layers to recover the N × 3 points p .The reconstruction loss is defined as the chamfer distance (CD) between the output p and the ground truth (GT) intact points p: To ensure the spatial consistency, the reconstructed p and GT p are fed to a pretrained PointNetLK to obtain a relative transformation T. We define a spatial regularisation loss S as the mean-squared error (MSE) between T and an identity matrix.The overall cost function is defined as (where is a weight factor):

Network Training and Evaluation
The networks were trained within the PyTorch framework.The full synthetic dataset M was randomly divided into training, validation and testing groups with a ratio (2) of 6:2:2.The Adam optimiser with a descending learning rate starting from 0.0005 was used.The reconstruction network was first trained for 300 epochs to minimise reconstruction loss only.Then, the network was further trained with the regularisation by a PointNetLK registration network well-trained on ModelNet40 dataset.The trained registration network could achieve comparable performance with traditional iterative closest point (ICP) over our dataset (for registration between p and p , the mean square difference between network output and ICP output is 0.0008 in average).The weight was chosen as 0.001.The training took 17 hours on an Nvidia Tesla P100-PCIe Graphic Processing Unit (GPU) before validation error further decreases.

Registration Test on Dataset
The performance of the proposed network was evaluated on the test dataset first.The points of reference geometry were scanned by a highly precise commercial scanner (HDI 3D scanner, LMI Technologies Inc.).After the RANSAC global alignment [21], the femur pose T def was computed by the standard ICP registration (Open3D library [22]) between reference points and the deformed points with or without reconstruction.The threshold corresponding distance was set to 5 mm.We evaluated the registration reliability in terms of fitness (i.e., the number of inlier correspondences divided by the total number of target points) and the root-mean-squared error (RMSE) between the registered inlier correspondences.The spatial tracking error induced by geometric changes was defined as the difference between registered poses T def and the GT poses T undef obtained between the undeformed frames and reference.As shown in Figure 3, by reconstruction, the registration fitness is improved from 76.29±6.50% to 87.55±5.06%,and the RMSE is reduced from 2.26±0.17mm to 1.96±0.11mm.Both the 3D positional and rotational accuracy are significantly (p-values<0.05by a Kruskal-Wallis one-way analysis of variance test).
We additionally ran a spatial consistency check by registering the reconstructed point clouds from multiple single-view captures to the same pre-scanned reference.As shown in Figure 4, the merged frames are consistent with the model, indicating that the 3D geometric features of the input were learned by the network and used for the surface restoration.

Test on Real-Time Captures
The combined markerless segmentation, reconstruction and registration workflow were then tested on a model knee (i.e., a drilled femur sawbones held by a metal leg).An optical marker was pinned into the leg and tracked by an optical tracker (Fusiontrack 500, Atracsys LLC) to provide the GT pose T gt for evaluation.The difference between the markerless femur pose (T) registered from the restored surface and the T gt obtained by marker- based tracking was defined as the tracking error: The pose registered from the raw segmented points without reconstruction was also evaluated as a reference.
The Python inference takes 0.05 s per frame for the real-time reconstruction.As shown in Figure 5, the reconstruction improves the registration fitness from 75.56±6.36%to 81.69±10.97%,and reduces the registration RMSE from 2.40±0.10mm to 2.07±0.30mm.However, the improvement in 3D spatial pose is not obvious: the Kruskal-Wallis test shows no significant difference (3) T err = T (T gt ) −1 .

Discussion
While the camera can only provide maximum 2.5D partial views, the geometry modification should be consistent across viewpoints to match a common 3D geometry.Learning 3D features is essential for such depth framebased modification.The network implicitly learns the structural features of the femur from a modified point cloud, and selects the ones relevant to the original frames through supervision.
The performance degradation from the test dataset to the actual capture is noticeable.We suspect that this may be due to the domain gap between the synthetic surface and the physically modified surface captured by a depth camera.In practice, the image sampling quality will be affected by the geometry variation, capturing angles and working distance.The effect of a shape modification on the actual capture cannot be directly superimposed.For example, a hole around a flat surface will look different from the same hole around the edge in the actual capture.Besides, the training dataset only contains one model geometry, which is slightly different from the shape of the actually tested sawbones.The decoder is prone to overfitting and not be fully generalised to a new geometry.

Conclusions
In this work, we explore the possibility of adopting a deep neural network for automatic femur surface restoration to improve the reliability of markerless tracking after surgical intervention.We train the network on a collect dataset with simulated realistic surface modification.While the registration quality is effectively increased, the improvement in tracking accuracy is obvious over test data but not under the physical setup.In the future, we will further improve the realism of synthetic deformation, involve more femur geometries (e.g., given by data-driven statistical shape models), and tune the network architecture for better performance.

Figure 1 Figure 2
Figure 1 Workflow of applying a collected realistic cutting pattern to original dataset D to generate synthetic modified dataset M

Figure 3
Figure 3 Comparison of registration quality and 3D spatial error obtained with default and reconstructed frames

Figure 4 Figure 5
Figure 4 Comparison between a reconstructed model fused from 20 frames (yellow) and the pre-scanned femur surface (blue)