Sparse Autoencoder-based Multi-head Deep Neural Networks for Machinery Fault Diagnostics with Detection of Novelties

Supervised fault diagnosis typically assumes that all the types of machinery failures are known. However, in practice unknown types of defect, i.e., novelties, may occur, whose detection is a challenging task. In this paper, a novel fault diagnostic method is developed for both diagnostics and detection of novelties. To this end, a sparse autoencoder-based multi-head Deep Neural Network (DNN) is presented to jointly learn a shared encoding representation for both unsupervised reconstruction and supervised classification of the monitoring data. The detection of novelties is based on the reconstruction error. Moreover, the computational burden is reduced by directly training the multi-head DNN with rectified linear unit activation function, instead of performing the pre-training and fine-tuning phases required for classical DNNs. The addressed method is applied to a benchmark bearing case study and to experimental data acquired from a delta 3D printer. The results show that its performance is satisfactory both in detection of novelties and fault diagnosis, outperforming other state-of-the-art methods. This research proposes a novel fault diagnostics method which can not only diagnose the known type of defect, but also detect unknown types of defects.


Introduction
With the objective of increasing the availability and reducing operation and maintenance cost of mechanical systems, Prognostics and Health Management (PHM) approaches has been getting more and more attention [1][2][3]. Fault diagnostics is one of the fundamental tasks of PHM, which aims at detecting and diagnosing machinery failure using model-based or data-driven approaches [4]. In the era of Industry 4.0, since mechanical systems are getting more and more complex, it is very difficult and expensive to develop physics-based degradation models required for modelbased approaches. Whereas the increased availability of data collected from multiple monitoring sensors and the grown ability of processing data by artificial intelligence algorithms have brought the great potential for the development of advanced data-driven approaches [5]. Data-driven approaches are typically based on the development of an empirical classification model trained on monitoring data. Chine et al. [6] proposed a fault diagnostics approach for the photovoltaic system based on Artificial Neural Networks (ANNs). Malik et al. [7] used Empirical Mode Decomposition (EMD) for feature extraction, and an ANN was trained using the extracted features for gearbox fault diagnostics. He et al. [8] extracted statistical features from monitoring signals and Support Vector Machines (SVMs) were developed for fault diagnostics for the 3D printer. Li et al. [9] utilized wavelet packet decomposition and SVM for the diagnostics of machinery faults of highvoltage circuit breaker failures. Liu et al. [10] proposed a bearing diagnostics approach which combined EMD and auto-regressive model to extract features from Yang et al. Chin. J. Mech. Eng. (2021) 34:54 vibration signals and used random forests to set an effective classification model. Hu et al. [11] developed a wind turbine bearing fault diagnosis based on multimasking EMD and fuzzy c-means clustering. Other approaches, such as k-nearest neighbor [12], naïve Bayes [13], linear discriminant analysis [14], fuzzy petri nets [15], extreme learning machines [16], have also been successfully developed for fault diagnostics.
Although these intelligent fault diagnostics approaches have shown great progress, there are still some limitations. On the one hand, they rely on the identification of handcrafted features requiring expert knowledge or computationally demanding feature selection methods. On the other hand, the shallow learning architecture of these approaches leads to poor performance for complex classification problems. To address these difficulties, recently Deep Learning (DL) methods have been extensively applied for machinery fault diagnostics [17][18][19][20]. DL methods are expected to automatically provide high-level representation by using neural networks with multiple layers of non-linear transformations, without requiring human-designed and labor-intensive analyses of the data [21,22]. Jia et al. [23] fed the frequency spectra of vibration signals into stacked AutoEncoders (AEs)-based Deep Neural Networks (DNNs) for rotating machinery diagnostics. Chen et al. [24] proposed Sparse AutoEncoder (SAE) and deep belief network for fault diagnosis of bearings. Lu et al. [25] employed denoising AEs for fault diagnostics of rotating machinery components. Shao et al. [26] presented a deep belief network by stacking multiple restricted Boltzmann machines for fault diagnostics of induction motors. Wang et al. [27] employed Convolutional Neural Networks (CNNs) for fault diagnostics of motors, where the input is the time-frequency map of the vibration signal. Jiang et al. [28] presented a method incorporating multiscale learning into the traditional CNN architecture for defect identification of wind turbine gearbox. Chang et al. [29] proposed a concurrent CNN composed of parallel convolution layers with multi-scale kernels for fault diagnosis of wind turbine bearings. Yuan et al. [30] investigated the development of different Recurrent Neural Networks (RNNs) including vanilla RNN, Long Short-Term Memory (LSTM) and gated recurrent unit for fault diagnostics and prognostics of aero engine. Chen et al. [31] utilized multi-scale CNNs to extract features that are then fed to a LSTM for bearing fault diagnostics. Echo State Networks (ESNs), another type of RNN characterized by high training efficiency, were successfully developed for fault diagnostics in Refs. [32] and [33], respectively. Zhang et al. [34] proposed a novel approach named deep hybrid state network integrating sparse AE and double-structure ESN for fault diagnosis of 3-D printers.
Even though these studies have outperformed other state-of-the-art fault diagnostics methods, the occurrence of novel conditions, which is common in real applications since the available monitoring data typically cannot cover all the possible types of defects and operating conditions, is seldom considered.
In this context, the objective of the present work is to develop a fault diagnostics method with the following characteristics: (i) it does not require the application of feature selection and extraction techniques; (ii) it can detect novel conditions not been recorded in the available dataset; and (iii) it can accurately diagnose known normal and faulty conditions.
We consider SAE-based method as a possible solution for this objective. An AE is a neural network with a symmetrical architecture, composed of an "encoder" and a "decoder" network. The "encoder" network transforms the large-dimensional input data into a small set of features and the "decoder" network reconstructs the data from the extracted features [35]. Since many natural signals show the sparsity property, which states that signals are with useful patterns occurred sparsely. The sparse modeling of signals has proven to be effective to extract the inherent low-dimensional features [36]. SAE, a variant of AE, employs sparsity penalty to encourage the extraction of discriminative features, which prevents the AE from simply copying the inputs and makes the features more representative for classification [37,38]. Since SAE tends to perform a poor reconstruction with data different from those used for its training, the reconstruction error of the input data is expected to be an indicator of novelty detection [39]. A SAE can be transformed into a SAE-based DNN for diagnostics [40] by: (i) pre-training multiple 1-hidden-layer SAEs and stacking them to build a multi-layer stacked SAE, and (ii) taking only the multilayer encoder network of the stacked SAE and adding a classification layer on it to build a DNN, then fine-tuning the DNN using input-output data. The "pre-training" and "fine-tuning" are used mainly due to the gradient vanishing/exploding problems when directly training the DNN caused by commonly adopted tanh or sigmoid nonlinear activation functions [35].
Various studies have demonstrated the success of DNNs for machinery fault diagnostics [23,[41][42][43][44][45][46]. However, they didn't consider the detection of previously unseen conditions, and most of them construct the DNN through the way of pre-training and fine-tuning which requires many computational efforts. In Ref. [39], Principi et al. proposed a novelty detection approach based on the reconstruction error of stacked AE. But the stacked AE is trained using normal data only, instead of data including normal and known multiple faulty conditions, which is common in fault diagnostics problems.
To further explore the capability of AE and AE-based DNNs for both novelty detection and fault diagnostics, we propose a SAE-based multi-head DNN for addressing the two problems. The multi-head DNN uses an encoder network to jointly learn a shared encoding representation, based on which a decoder network and a classification module are employed for unsupervised reconstruction and supervised classification, respectively. The novelty detection is realized by comparing the reconstruction error with a pre-defined threshold. In addition, the Rectified Linear Unit (ReLU) activation function [35], which is widely used for the training of CNNs to relieve the gradient vanishing/exploding problems, is adopted for the multi-head DNN to make it possible to directly train the DNN instead of following the conventional way of pre-training and fine-tuning.
The proposed method is validated by using two case studies about machinery fault diagnostics. The first case is from a benchmark considering bearings with different types of defects operating under different loads. The second one is a real experiment in which the monitoring data of a delta 3D printer with different types of defects are collected. The performance of the proposed method is compared to that of other commonly used novelty detection and fault diagnostics methods.
The remaining of this paper is organized as follows. Section 2 presents the problem statement. The proposed fault diagnostics method is illustrated in Section 3. Section 4 shows the applications of the proposed method to a benchmark bearing diagnostics case study and to experimental data of a delta 3D printer. Finally, conclusions are drawn in Section 5.

Problem Statement
The objective of this work is to develop a machinery fault diagnostics method being able to identify the unknown faulty conditions, and to diagnose the known normal and faulty conditions among C different classes. We assume to have available the measurements of S signals collected during the operation of the machinery under all the already known C conditions. For ease of notation, we assume all the signals are collected at the same sampling rate. Let x k s (τ ) , τ = 1, . . . , T k , be the τ th data point collected in the s-th signal, s ∈ {1, . . . , S} , under the kth condition, k ∈ {1, . . . , C} , where T k is the time at which the last data point under the k th condition is collected.
Each signal x k s is segmented into pieces containing M data points using a non-overlapping window, and ⌊T /M⌋ pieces are obtained. Then, all the S signals belonging to the same piece are gathered together as a sample Finally, all the available N = C k=1 N k samples are lumped together to form a dataset of N inputoutput pairs x i , y i , i = 1, . . . , N , where the output y i ∈ {1, . . . , C} is the corresponding label of the sample class.
The proposed method receives the test sample collected from test equipment as the input, and is required to identify whether it operates in any of already known conditions, if no, a novel condition is detected, otherwise the class y TEST ∈ {1, . . . , C} is diagnosed.

Sparse Autoencoder
An autoencoder is an unsupervised neural network with a symmetrical structure [35], as shown in Figure 1.
The input D-dimensional sample x is transformed into its hidden representation a = a 1 , a 2 , . . . , a D 1 from the input layer to hidden layer, known as the "encoder": where σ , W 1 , b 1 are the activation function, the weight matrix and the bias vector of the encoder, respectively.
Then, the decoder, i.e., the network from hidden layer to the output layer, reconstructs the input x to x based on the feature vector a: where W 2 , b 2 are the decoder weight matrix and the bias vector, respectively. As a variant of the autoencoder, a sparse autoencoder encourages the extraction of discriminative features by adding a sparse restriction for the network training [37]. Given the available input samples x i , i = 1, . . . , N , the training objective of SAE is to minimize the following cost function: where the first term is the reconstruction error, the second term is the L2 regularization term where W is the SAE weight matrix, R sparse is the sparsity regularization, and and β are coefficients control the importance of the corresponding terms. It has been found that constraining hidden neurons to be inactive most of the time makes them respond to different patterns lying in the data, i.e., the extracted features a are discriminative [37]. Let p j be the average activation of the jth hidden neuron of the SAE hidden layer, considering all the input samples where a i,j is the jth element of the ith hidden representation a i , j = 1, · · · , D 1 . The sparsity regularization in Eq.
(3), R sparse , is calculated using the Kullback-Leibler (KL) divergence function to measure whether p j is close to a desired small sparsity proportion p: The KL function is zero when all p j are equal to p and increases when they diverge.

Multi-head Deep Neural Network
We propose a SAE-based multi-head DNN for both novelty detection and diagnostics of known conditions. The multi-head DNN consists of three modules: an encoder, a decoder and a classification module, as shown in Figure 2. The encoder extracts high-level representations from input data using multiple layers of non-linear transformations. The extracted representations are shared by the decoder and the classification module as their input. The decoder aims at reconstructing the input data whereas the classification module predicts the label of the input data. More specifically, the construction of the multi-head deep neural network includes: (1) Encoder The encoder receives the sample x as the input and extracts high-level representation of x using L hidden layers, LǫN + . Let a (1) , ..., a (L) be hidden representations extracted from the corresponding hidden layer, their dimensions D 1 , ..., D L , are set as following: (i) D 1 is typically set larger than that of input layer, D , to obtain a sparse-overcomplete representation at the first hidden layer, which has been shown able to extract independent basis functions for input data [47]; (ii) the dimension D l , l = 2, 3, . . . , L of the remaining hidden layers should be smaller than D l−1 to obtained compressed representation. The ReLU activation function is employed in the input layer and all the hidden layers. The sparsity regularization defined in Eq. (5) is applied to all the hidden layers to encourage the extraction of discriminative features, and the L2 regularization is employed to constraint the weights of the encoder.
(2) Decoder The structures of the decoder and the encoder, i.e., the number and dimensions of hidden layers, are exactly symmetrical. The decoder aims at recovering the input x using a (L) . We use ReLU activation function in all the hidden layers and the sigmoid activation function in the output layer for reconstruction. The L2 regularization is used to constraint the weights of the decoder.  (2021) 34:54 (3) Classification module The classification module employs a softmax layer with C neurons, representing different conditions, to solve the C-class classification problem. Given the representation a (L) , the softmax layer gives a vector [ y 1 , y 2 , . . . , y C ] as the output. The kth unit of the output, y k , k ∈ {1, 2, . . . , C} , is typically regarded as a number proportional to the conditional probability that the machinery is in the kth condition given sample x: where 0 ≤ y k ≤ 1 , C k=1 y k = 1 and z k is the kth output unit before applying the softmax activation function: where w k and b k are weights and bias of the kth neuron of the softmax layer.
To prevent over-fitting, in the classification module, we employ the dropout regularization on the hidden layer a (L) . Dropout randomly sets to zero a proportion p drop of the hidden neurons during forward and backpropagation [35]. Therefore, the following equation is used for computing z k considering the dropout, instead of using Eq. (7): where • is the element-wise multiplication operator and r ∈ R D L is a 'masking' vector of Bernoulli random variables with probability p drop of being 0. Gradients are backpropagated only through the unmasked neurons.
To be associated with the neurons of the softmax layer, in the training set x i , y i , i = 1, . . . , N , each label y i is transformed into a one-hot C-dimensional vector y i,1 , y i,2 , . . . , y i,C i=1,...,N , where The training objective of the multi-head DNN over the training set x i , y i , i = 1, . . . , N , is to minimize the following cost function: where the first term is the reconstruction error, the second term is the L2 regularization term where W is the weight matrix of the whole multi-head DNN, R (j) sparse is the sparsity regularization applied on the jth hidden layer (6) of the encoder, the last term is the cross-entropy loss measuring the performance of the classification module, , β , η 1 and η 2 are coefficients controlling the importance of the corresponding terms.

Overview of the Proposed Method
The flow chart of the proposed fault diagnostics method using the multi-head DNN is shown in Figure 3. The original monitoring signals are segmented into data samples (Section 2). Then a multi-head DNN is built and trained using data samples collected from already known C classes associated with normal and faulty conditions. Given a test data sample, the multi-head DNN first identifies whether it belongs to the already known C classes. If yes, an unknown condition is reported, otherwise the multi-head DNN diagnoses the label of the test sample among C already known classes.  With respect to the novelty detection, a test sample x TEST is detected as unknown when the magnitude of its reconstruction error is larger than a certain threshold δ: In this work, we set the threshold as: where Q1 and Q3 are the 25th and 75th percentiles of N , i.e., reconstruction errors over the training set. Notice that δ is the upper whisker in the box-plot method [48].

Experimental Evaluations
The proposed method has been verified with respect to data collected from a bearing fault benchmark, before its application to the experimental data acquired on a delta 3D printer. All computations have been performed using an Intel Core i5-5200 CPU at 2.2 GHz processor with 4 GB RAM in Python 3.6 environment.

Evaluation of the Proposed Method Using Benchmark Data
We consider the benchmark bearing diagnostics dataset provided by the Case Western Reserve University, which contains vibration data collected from an experimental rig with defective bearings operating under four different loads [49]. During the experiment, besides the normal condition, three different kinds of fault, i.e., inner race fault, outer race fault (at 6 o'clock) and ball fault, were introduced the drive-end bearing of the motor with fault diameters of 0.18 mm, 0.36 mm and 0.54 mm, respectively. Table 1 shows the detailed description of the dataset. Vibration signals were collected at sampling (11)  The vibration signal of each condition is segmented into samples using a non-overlapping fixed-length time window containing 1024 data points. We implement Fast Fourier Transformation (FFT) on each sample to get the 1024 Fourier amplitude coefficients. Since the coefficients are symmetric, the first 512 coefficients are used for each sample. The last column of Table 1 listed the number of samples obtained for each bearing condition.
We assume normal, inner race faults and balls faults are already known conditions (classes 1-7), and the outer race faults (classes 8-10) are unknown conditions. The training set is composed of 70% of the data randomly selected from classes 1-7, respectively. The remaining data of classes 1-7 and all the data of classes 8-10 are used as the test set. During the training of the multi-head DNN, 5% of the training data is randomly selected as the validation set to prevent the overfitting of the model. The developed multi-head DNN is formed by an encoder with an input layer of D = 512 neurons and L = 3 hidden layers of D 1 = 600 , D 2 = 100 , D 3 = 10 neurons, a symmetric decoder, and a classification module with a softmax layer of 7 neurons associated with the known 1-7 classes in the training set. The hyperparameters of the multihead DNN are set as follows: p = 0.05 , = 1×10 −7 , β = 1 , η 1 = 80 , η 2 = 1 and p drop = 0.3.
(1) Novelty detection With respect to novelty detection, the objective is to isolate samples of unknown classes from those of known classes in the test set. The box plot of reconstruction errors on different datasets is shown in Figure 4. The threshold for novelty detection, δ , is set to 5.25 × 10 −4  34:54 in this case study based on the reconstruction errors on the training set, as described in Eq. (12). Notice that the majority of reconstruction errors of known class test samples are below the threshold whereas those of unknown class test samples are above the threshold, i.e., the known and unknown class test samples are well separated.
In order to evaluate the novelty detection performance of the proposed method quantitatively, we denote test samples of known (classes 1-7) and unknown (classes 8-10) classes as positive and negative, respectively. And The TPR and TNR are the proportions of correctly classified positive and negative samples, respectively. The F 1 score is widely used as a performance metric for binary classification models, whose value ranges in [0, 1] where the larger value indicates better performance.
The result of novelty detection of the proposed method is shown in Table 2. The proposed method has been compared with two popular one-class learning methods, one-class SVM and Isolation forest. One-class SVM aims at constructing a smooth boundary around the majority of probability mass of data [50,51]. Since one-class SVM is a shallow model which prefer low-dimensional input data, the D 3 = 10 dimensional feature vectors a . . , N , extracted by the encoder from the training set are used as its input. The Radial Basis Function (RBF) kernel is used for one-class SVM, and the parameter ν , i.e., the assumed proportion of negative samples in the training set, is set to 0.01, which has been optimized with the objective of maximizing the F 1 score by trial-anderror considering as possible options {0, 0.01, . . . , 0.2} , respectively. The isolation forest employs decision trees for novelty detection, each tree is constructed by randomly splitting features and the anomalous data will produce significant shorter paths in trees. The isolation forest is also fed by D 3 = 10 dimensional feature vectors a (3) i , i = 1, . . . , N , extracted by the encoder. Its parameter N tree , the number of trees, is set to 40, which has been optimized with the objective of maximizing the F 1 score by trial-and-error considering as possible options {10, 20, . . . , 100} , respectively.
As shown in Table 2, the proposed method provides a satisfied TNR=98.53%, which means that nearly all the samples of the unknown classes are correctly detected, and the TPR is 88.57% which means that most of the samples of known classes can be identified. The TPR of one-class SVM is large whereas its TNR is only 4.14%, indicating that most of the samples of unknown classes cannot be detected. The isolation forest gets the smallest TPR indicating only about 61.32% samples of known classes can be identified. Moreover, the proposed method gets a larger F 1 score which also indicates that its performance is better than that of one-class SVM and isolation forest.
(2) Fault classification The fault classification performance of the developed multi-head DNN is evaluated considering the "accuracy" of classification on the known class samples in the test set, which computes the proportion of correctly classified samples: Table 3 shows the obtained results. The fault classification performance of the proposed method is satisfactory, characterized by a 100% accuracy. The proposed method has been compared with two state-of-the-art (16) Accuracy = TP + TN TP + FN + TN + FP .  fault diagnostics methods based on the use of a SVM, a 1-hidden-layer ANN and a k-nearest neighbor (kNN) model. Since SVMs, ANNs and kNNs are shallow models which prefer low-dimensional input data, the D 3 = 10 dimensional feature vectors a (3) i , i = 1, . . . , N , extracted by the encoder are used as their input. The RBF kernel is employed for the SVM and the regularization parameter of SVM is set to 1, which is optimized by considering as possible options {0.1, 0.2, . . . , 5} . The number of hidden neurons of the ANN has been optimized by trialand-error considering as possible options {6, 7, . . . , 50} . An ANN model with layers of (10, 10, 7) neurons and ReLU activation functions have been selected. The number of reference neighbors of kNN is set to 3, which has been optimized by trial-and-error considering as possible options {1, 2, . . . , 5} . The accuracies of the SVM, ANN and kNN are 100%, 99.85% and 100%, respectively, which are comparable with the proposed methods. These results confirm that the proposed method is accurate on classification problem and the features extracted by the multi-head DNN is discriminative enough to help shallow models to achieve good performance. (

3) Selection of hyperparameters and activation function
The proposed method has six parameters. The , β , η 1 , η 2 are used to balance the values of terms in the cost function (Eq. (10)) and are set by calculating the magnitude ratio of the terms in the training phase. The p drop is used to prevent overfitting of the classification module and is typically suggested to be around 0.5. We selected p drop to be 0.3 by experience since it does not influence the classification accuracy a lot.
The sparsity proportion p is the most sensitive hyperparameter for SAE-based DNNs. During model training, a validation set formed by 5% of the training data is randomly selected, and p is selected based on the performance over the validation set considering the possible options {0.05, 0.1, 0.2, 0.3, 0.4} . Table 4 reports the performance of the proposed method on validation set regarding different values of p . And p = 0.05 with the smallest reconstruction error and largest classification accuracy is selected.
In addition, we have investigated the use of a more cutting-edge activation function, Leaky ReLU, in the proposed method instead of ReLU. Leaky ReLU is a variant of ReLU, which maps negative inputs to small negative values instead of zeros. However, with Leaky ReLU, the proposed method gets F 1 score 0.74, TNR 66.74%, TPR 80.40% and classification accuracy 99.93%. Compared with the results obtained with ReLU (Tables 2 and 3), the performance of Leaky ReLU is comparable regarding the classification accuracy, but much worse with respect to novelty detection. A possible reason is that the input of the multi-head DNN is the FFT coefficients which are positive. The Leaky ReLU keeps negative values during the computation, which weakens the data reconstruction and leads to poorer novelty detection.

Experiment Evaluation for 3D Printer Diagnostics
In this Section, the proposed method is applied to diagnose the fault of a delta 3D printer (SLD-BL600-6) [52]. The extruder nozzle of the delta 3D printer was controlled to perform a predefined circular movement with a radius 75mm. A multi-channel attitude sensor was mounted on the moving platform to monitor its 3-axial angular acceleration, vibration acceleration and magnetic field intensity ( Figure 5).
The wear of joint bearings and synchronous belts were considered as faulty conditions. The faults of joint bearings were introduced by loosening the screw of each joint bearing by half-turn, i.e., 0.35 mm. And the faults of synchronous belts were injected by relaxing the length of two teeth, i.e., 3 mm, for each belt. In each fault condition, we consider exclusively the fault of one joint bearing or one synchronous belt. As listed in Table 5, 15 faulty conditions are simulated in total, including faults of 12 joint bearings and 3 synchronous belts. Printing tests were performed under these faulty conditions and we found that the printing quality of the 3D printer was affected seriously. Figure 6 shows examples of the normal mode and faulty mode of the joint bearing and synchronous belt, respectively. The 9-channel monitoring data were collected under normal and all the faulty conditions at the sampling frequency of 100 Hz. For each condition, an experiment was performed to collect monitoring data for 20 circular movements, each channel of which contains 32400 data points. The data were then divided into 253 samples using a non-overlapping fixed-length time window, each channel of which contains 128 data points. The experiment was repeated for 3 times, therefore, 759 samples were collected for each condition. For each sample, the FFT was implemented on each of its channels to get the 128 Fourier amplitude coefficients. Considering the symmetry of the coefficients, the first 64 coefficients are used for each channel. Then, coefficients of all the channels are concatenated to form a 576-dimensional vector, which is used as the representation of the sample.
We assume normal and joint bearing faults are already known conditions (classes 1-13), and the synchronous belt faults (classes [14][15][16] are unknown conditions. The training set is composed of 70% of the data randomly selected from classes 1-13, respectively. The remaining data of classes 1-13 and all the data of classes 14-16 are used as the test set. During the training of the multihead DNN, 5% of the training data is randomly selected as the validation set to prevent overfitting. The developed multi-head DNN is formed by an encoder with an input layer of D = 576 neurons and L = 3 hidden layers of D 1 = 600 , D 2 = 100 , D 3 = 30 neurons, a symmetric decoder, and a classification module with a softmax layer of 13 neurons associated with the known 1-13 classes in the training set. The hyperparameters of the multi-head DNN are set as follows: p = 0.05 , = 1 × 10 −7 , β = 1 , η 1 = 80 , η 2 = 1 and p drop = 0.3. (1) Novelty detection The box plot of reconstruction errors on different datasets is shown in Figure 7. The threshold for novelty detection, δ , is set to 1.60 × 10 −4 based on the reconstruction errors on the training set, as described in Eq. (12). The known and unknown class test samples are well separated by the threshold, since the majority of reconstruction errors of known class test samples are smaller than the threshold whereas those of unknown class test samples are larger than it. Table 6 shows the result of novelty detection of the proposed method and those of the one-class SVM and isolation forest. The D 3 = 30 dimensional feature vectors a (3) i , i = 1, . . . , N , extracted by the encoder from the training set are used as the input of one-class SVM and isolation forest. A one-class SVM model with the RBF kernel and ν = 0.01 has been selected. The parameter ν has been optimized to maximize the F 1 score by trial-anderror considering as possible options {0, 0.01, . . . , 0.2} ,   respectively. The parameter of isolation forest N tree is chosen as 50, which has been optimized with the objective of maximizing the F 1 score by trial-and-error considering as possible options {10, 20, . . . , 100} , respectively. Both the TNR and TPR of the proposed method are larger than 90%, which means that most of the samples of the unknown classes and known classes are correctly detected. The TPR of one-class SVM is 98.68% whereas the TNR is 32.67%, indicating that more than half of the samples of unknown classes cannot be detected. With similar problem, only 58.04% of the samples of unknown classes can be detected by the isolation forest. Moreover, the F 1 score of the proposed method is larger than that of the one-class SVM, indicating that its performance is better.
(2) Fault classification Table 7 shows the obtained results. The fault classification performance of the proposed method is 97.56%. The proposed method has been compared with fault diagnostic methods based on the use of a SVM, a 1-hiddenlayer ANN and a kNN model. Similar to Section 4.1, the D 3 = 30 dimensional feature vectors a (3) i , i = 1, . . . , N , extracted by the encoder are used as the input of SVM, ANN and kNN. The RBF kernel is employed for the SVM and the regularization parameter of SVM is set to 1, which is optimized by considering as possible options {0.1, 0.2, . . . , 5} . The number of hidden neurons of the ANN has been optimized by trial-and-error considering as possible options {6, 7, . . . , 80} . An ANN model with layers of (30,30,13) neurons and ReLU activation functions have been selected. The number of reference neighbors of kNN is set to 3, which has been optimized by trial-and-error considering as possible options {1, 2, . . . , 5} . The accuracy of SVM is 97.56%, the same as that of the proposed method. The accuracies of ANN and kNN are 96.98% and 96.92%, respectively, which are slightly worse. These results confirm that the classification performance of the proposed method is no less than the other commonly used diagnostic methods.

Conclusions
This paper contributes to addressing the problem of fault diagnostics with novelty detection capability based on the use of SAE-based multi-head DNNs. The proposed method allows jointly performing two tasks: i) data reconstruction for novelty detection and ii) classification for diagnostics, using a single model, where features shared by these two tasks are automatically extracted from high-dimensional data. Furthermore, the use of ReLU activation function allows the reduction of computational burden by direct training of the DNN, instead of requiring conventional training procedures including pre-training and fine-tuning.
The fault diagnostics method has been verified on two case studies. The results obtained in the case studies show that: i) the proposed method can be applied for novelty detection of the machinery where multiple conditions are already known, and performs significantly better than one-class SVM and isolation forest; ii) the diagnostics accuracy of the proposed method is satisfactory, no less than the SVM and ANN-based fault diagnostic methods.
The setting of hyperparameters of the multi-head DNN is based on trial-and-error and experience. In future studies, we will focus on designing a systematic strategy for the hyperparameter setting to further facilitate the application of the proposed method. Furthermore, the proposed method detects novelties based on the reconstruction error, which is effective but indirect. A more direct way is to extract very representative features of available data, which should be compact within a clear boundary, but still be separable between those of different known types of defects. And the boundary could be employed for the detection of novelties. Therefore, the design of advanced cost function for representative feature extraction will be considered in our future work.