Model Parameter Transfer for Gear Fault Diagnosis under Varying Working Conditions

Gear fault diagnosis technologies have received rapid development and been effectively implemented in many engineering applications. However, the various working conditions would degrade the diagnostic performance and make gear fault diagnosis (GFD) more and more challenging. In this paper, a novel model parameter transfer (NMPT) is proposed to boost the performance of GFD under varying working conditions. Based on the previous transfer strategy that controls empirical risk of source domain, this method further integrates the superiorities of multi-task learning with the idea of transfer learning (TL) to acquire transferable knowledge by minimizing the discrepancies of separating hyperplanes between one specific working condition (target domain) and another (source domain), and then transferring both commonality and specialty parameters over tasks to make use of source domain samples to assist target GFD task when sufficient labeled samples from target domain are unavailable. For NMPT implementation, insufficient target domain features and abundant source domain features with supervised information are fed into NMPT model to train a robust classifier for target GFD task. Related experiments prove that NMPT is expected to be a valuable technology to boost practical GFD performance under various working conditions. The proposed methods provides a transfer learning-based framework to handle the problem of insufficient training samples in target task caused by variable operation conditions.


Introduction
Gear has been used extensively in transmission system due to its large velocity ratio, strong bearing capacity, compactness and high efficiency [1][2][3][4]. Gear fault diagnosis (GFD) also becomes one of the most important research hotspots from both industrial and academic communities for ensuring the safe and efficient operation of gear transmission system. Till now, with the development of sensing methods (e.g., vibration, rotor speed, acoustic signal and others), data-driven methods [5,6], which are based on analyzing measured data without need of a deep understanding of the mechanical drive systems, have become more and more attractive and been proved to be valid in the field of gear fault diagnosis. Generally, there are two steps in data-driven method: (1) constructing a classification model based on sampled data, and (2) using the welltrained model to predict the mechanical fault type. In many existing researches, the fault diagnostic task can be treated as a problem of pattern recognition, which usually is composed of two technical processes: (1) feature extraction, and (2) fault recognition. The purpose of feature extraction is to obtain low-dimensional fault descriptors from high-dimensional vibration data. There are many advanced signal processing methods that have been proposed to provide cognizable features, such as wavelet transform (WT) [7], principal components analysis (PCA) [8], singular value decomposition (SVD) [9], empirical mode decomposition (EMD) [10], etc. Then, conventional machine learning methods (e.g., extreme learning machine, support vector machine and neural network) are employed to build a gear fault diagnostic model. However, these conventional methods usually work for GFD under constant speed and load conditions, thus having weak generality when facing with variable working conditions. Generally, gears are often working under time-varying operating conditions, for example, the running states of gas turbines or wind power generators change very often while working, and the operation parameters of planetary gearbox may vary correspondingly, thus inevitably resulting in a consequence that the extracted features in one time period might be different from those in the next time period. More importantly, the identical and independent distribution (IID) between training data and test data is required to ensure effective implementation of these conventional machine learning methods. Recently, these problems have aroused researchers' interest and received intensive attentions. For example, Song et al. [11] developed a new singular value decomposition interpolation (SVDI) based signal processing method, in which the time-domain and frequency-domain characteristic matrices extracted from vibration signals under discrete working conditions were firstly decomposed into singular vectors, rotation matrices and characteristic means with SVD, then these three parts were interpolated to reconstruct the target eigenmatrix for data augmentation. Han et al. [12] utilized empirical mode decomposition (EMD) to decompose vibration signals into several intrinsic mode functions (IMF), and extracted feature vectors that consist of time domain indexes, frequency domain indexes, energy domain characteristic parameter and fractal box dimension from the selected IMF to investigate the dynamic feature of vibration signal accurately and improve the robustness of feature vectors under different loads for GFD. Meanwhile, Zhao et al. [13] designed a synchrosqueezing transform (SST) and deep convolutional neural network (DCNN) based method for gearbox fault classification under varying operation conditions, where a new index, the envelope time-frequency representation (TFR), was calculated by using SST, then DCNN was adopted to dig underlying features of the TFRs and determine the fault type of planetary gearbox automatically. In general, most of these methods can achieve good results by exploring advanced feature extraction methods or building a complex network classifier, but they rely on sufficient labeled training dataset normally, which could degrade performance when facing with insufficient data. However, only a few number of labeled samples collected for training probably exist in many real-world applications, which hinder the promotion of these methods greatly.
Therefore, how to train a robust model with high accuracy under limited labeled data is important. Recently, transfer learning (TL), a fast-growing filed of machine learning, has been emerging due to its knowledge transfer ability [14]. To be delighted, the amount of labeled target data (termed as target domain, TD) maybe small, but there are still plenty of relevant data which can be obtained in machine industry from another time period (e.g., under another speed and load) or adjacent components (termed as source domain, SD). By utilizing the TL technology, useful information can be extracted from existing or previous task to boost the learning efficiency of target task. The model parameter transfer (MPT), one of the transfer learning architectures, is an effective tool to transfer the shared parameters or prior distributions of hyperparameters. Recently, most of these approaches are designed to work for multitask learning (MTL). For example, Lawrence et al. [15] succeeded in learning parameters from multiple tasks through the shared Gaussian process (GP) prior. Bonilla et al. [16] proposed a GP-based model to learn the shared model knowledge over tasks. Schwaighofer et al. [17] succeed in learning multi-tasks by utilizing the combination of hierarchical Bayesian framework (HB) and GP. Besides, Evgenious et al. [18] proposed a new algorithm by referencing HB idea to solve multitask learning in the frame of support vector machine (SVM). All these methods can be easily modified for TL. Strictly speaking, MTL tries to learn different tasks jointly and simultaneously, while TL prefer to improve the performance of TD task with the help of knowledge extracted and stored from SD data. Comparison between MTL and TL is shown in Figure 1. Intuitively, we may minimize the difference in parameters of classification hyperplane between TD and SD to transfer the knowledge obtained from SD, so that a robust GFD model with better performance in TD can obtained.
According to the above analysis, a novel model parameter transfer (NMPT) approach, which aims at excavating and further transferring the shared characteristic parameters of hyperplane for the problems of insufficient labeled training samples and non-IID between source and target domains, is developed to assist target gear fault identification using source domain data. Specifically, on this basis of controlling the empirical risk of source domain, the proposed method further integrates the advantage of the conventional MPT and TL together, which can be concluded that: (a) the least square support vector machine (LSSVM) based MPT can characterize the shared and domain-specific parameters over tasks; and (b) the idea of TL is introduced to dig and extract transferable knowledge and to minimize the distributional discrepancies between source and target domains. To sum up, the novelties and main contributions of this paper can be summarized as: • Based on controlling the empirical risk of source domain features in LSSVM framework, an improved TL model is proposed by further minimizing the discrepancies of separating hyperplanes between source and target domains, and then transferring both shared and domain-specific parameters over tasks to make use of source domain data to assist target diagnostic task; • The model parameter transfer idea is innovatively introduced to the area of gear fault diagnosis, which provides a new idea for gear fault diagnosis under variable working conditions, especially when sufficient training data from target domain are not available.
The rest of this paper is organized as follows. In Section 2, the theoretical background is briefly presented. Section 3 concentrates on introducing details of the proposed NMPT method and then gives the whole framework of GFD. Section 4 illustrates the experimental study and proves that NMPT can achieve good results in GFD under variable working conditions. Finally, some conclusions drawn from this paper are listed in Section 5.

Theoretical Background
This study is going to leverage the NMPT model under LSSVM framework for GFD. Therefore, in this section, the fundamental theory of LSSVM as well as its improvement for MTL are briefly reviewed.

Least Squares Support Vector Machine (LSSVM)
First, the basic principle of training a SVM-based model for classification problem is to find the optimal separating hyperplane (f = w*φ(x) + b) in a reproducing kernel Hilbert space (RKHS) [19]. According to structural risk minimization (SRM) principle, the optional w and b can be obtained by minimizing the following function: where C is positive real regularized parameter, w is weight vector defining the orientation of separating hyperplane, R represents structural risk, R emp denotes loss function which controls the error of separating hyperplane f on training data, and different kinds of R emp can contribute to different forms of SVMs. By utilizing squared error function, the SRM problem in LSSVM is to compute the optimal decision-made separating hyperplane according to the vector x and its label y∈{−1,+1} by minimizing the following function with a constraint, which can be formulated as: where e i is error function, φ(·) denotes a transform function that maps the input features x into RKHS, b is a bias term, N indicates the total number of training samples. Then a classification hyperplane f = w*φ(x) + b is constructed for this task.

Multi-Task LSSVM (MTLSSVM)
Given m learning tasks, the MTL aims to learn all tasks simultaneously rather than individually. Let each task ∀i∈m, we have n i training samples x i,j , y i,j n i j=1 , thus the total number of training samples is N = m i=1 n i . Based on the regularization framework and hierarchical Bayesian framework, some researchers assumed that all w i can be rewritten as w i = w 0 + v i , where w 0 (playing the role of mean vector) and v i carry the information of commonality and specialty over tasks [20,21], respectively. That is to say, when m learning tasks are analogous to each other, the vectors v i tend to be "small", otherwise, the vector w 0 tends to be "small". To this end, the following optimization problem which is similar to LSSVM for single task is solved to estimate all v i as well as w 0 simultaneously: where C and λ are positive real regularized parameters, These previous works of LSSVM and MTLSSVM are not oriented to the target task where there exists the problem of insufficient training data or non-IID between training and testing data. Whereas, it is significant to derive useful information from these existed models to enhance the TD task. Therefore, different from the single task learning and multitask learning, the proposed NMPT utilizes SD data (related but different from TD) to solve target domain problems with a specific structure, which is introduced in the following section.

Proposed NMPT Framework for GFD
The proposed NMPT method via transferring the knowledge of classification hyperplane from SD to TD is presented in this section.

Basic Definition
Given SD and TD, the main purpose of NMPT can be described as: under LSSVM framework, NMPT aims to improve the performance of TD classification model f t = w t *x t + b t by using the knowledge from source domain classifiers model f s = w s *x s + b s , where the SD and TD are different but similar in some aspects. In addition, the training data is set as follows: where Ds, Dt are SD and TD labeled data, respectively; x s j ,y s j denote the jth feature vector and corresponding label of SD data;x t i ,y t i denote the ith feature vector and corresponding label of TD data; Ns and Nt represent the number of SD and TD, in this paper, Nt<< Ns.

NMPT Architecture
In this section, the proposed NMPT approach is discussed. As mentioned above, the method mainly utilizes the labeled data from SD and TD to solve the target GFD problem. First, inspired by the work of multitask LSSVM framework [21,22], we assume that the parameters, w t and w s form both tasks can be separated into two parts, respectively: where w 0 is the shared parameter, v s and v t are the domain-specific parameters of SD and TD tasks, respectively. Then, based on previous transfer strategy that controls empirical risk of source domain, we want to find the knowledge from w s and transfer it to w t ulteriorly. As enough training data can prevent the model from overfitting, parameter w 0 from w s is set as one of transfer knowledge. In addition, by minimizing the term μ|| v t −v s || 2 during the optimization process, we can also recognize and apply knowledge of v s learned from SD. Hence, to achieve this goal, an extension of LSSVM to transfer learning case is built as follows: where w 0 and μ|| v t − v s || 2 are transfer learning items, Cs, Ct, λ and μ are positive real regularized parameters. An illustration that describes the diagram of NMPT is presented in Figure 2.
As less tagged target training data will cause the corresponding classification model to show some tendency towards performance degradation, the decision boundary with parameter w t from target task could suffer from this problem. However, by utilizing the knowledge of w s from source domain, NMPT architecture can ensure a relatively small generalization error on the target domain by mainly focusing on achieving the following goals: (1) learning a more accurate w 0 for target domain; (2) reducing the difference of (4) model parameters by minimizing μ|| v t −v s || 2 (see the purple line in Figure 2). These two goals can make source domain model be applicable for target domain and ensure the leading role of Dt in building classification model for target task. In addition, by comparing eq. (2) with eq. (6), we find the NMPT model tries to make the separating hyperplane of SD be qualified for TD classification task from two aspects on the basis of SRM principle: one is to minimize the margin discrepancies of training data between SD and TD to adjust separating hyperplane, the other is to control loss function on SD data, simultaneously. All these two improvements can prove a good capability of generalization on TD. Then, the solving process of NMPT optimization problem (c.f. Eq. (6)) is listed as follows: First, the Lagrangian function for Eq. (6) is built as: where a i is a Lagrange multiplier. Then, according to Karush-Kuhn-Tucker (KKT) conditions, the solutions for optimality are yielded as: where v t and v s can be derived as: By eliminating w 0 , v t , v s and e i through substitution, one linear system can be obtained as follows: represents the kernel function, the detail element in Ω is defined as: The best fit values of parameters a, b t and b s can be finally worked out, then the corresponding decision function can be constructed as follows:

Complete Process of NMPT Model for Gear Fault Diagnosis
In the proposed framework, an intrinsic time-scale decomposition (ITD) architecture is first introduced to decompose a vibration signal into a set of proper rotation components (PRCs). Then, the energy parameter of each proper rotation component (PRC) is calculated to conduct dimensionality reduction and construct feature vectors. By structuring and solving the optimization problem of NMPT (c.f. Eq. (6)) using the learned fault representations, the parameters of NMPT model (including w 0 , v s v t , b s and b t ) can be learned simultaneously. Finally, the target data are fed into NMPT to output the predicted fault categories. Figure 3 gives the overall proposed framework for NMPT-based GFD.

Descriptions of Experimental Simulator and Datasets
To conduct experimental verification, the testing platform, drivetrain dynamics simulator (DDS), is shown in Figure 4. It includes driving motor, speed regulator, planetary gearbox, reduction gearbox, brake device, brake regulator. During data collection, the variety of speeds and loads can be implemented through speed regulator and brake regulator, respectively. Meanwhile, there are altogether 7 vibration sensors (model: 608A11, sample frequency: 5120 Hz) in the structure, one is mounted on the surface of motor to measure z-axial vibration signal of the motor (F1), the rest are as follows: three for planetary gearbox (F2) and three for reduction gearbox (F3). Except for the healthy gear (Healthy, C1), there are four different types of gear faults, denoted as a small piece of material breaking away from tooth (Chipped, C2), a tooth fracturing at the location of root (Missing, C3), the emergence of cracks on root cracked (Cracked, C4) and the loss of material from the contacting surface of tooth (Worn, C5). The descriptions of fault types and different experiment conditions are shown in Table 1.

Feature Extraction
Intrinsic time-scale decomposition (ITD) , proposed by Frei et al. [23], is a time frequency analysis method which can adaptively decompose a given vibration signal X into a series of proper rotation components (PRCs) and a  monotonous trend signal (remaining baseline signal) with low end effects and high efficiency, which can described as: where p denotes the final decomposition level, H i is the ith PRC, L p is the remaining baseline signal. Nevertheless, these obtained PRCs with ITD technology are too complex to be taken as fault vectors as inputs for conducting fault classification directly. Thus, the energies of first six level PRCs are calculated for dimensionality reduction of PRCs and fault feature design.

Experimental Study
In this part, the diagnostic performance of the proposed NMPT is first analyzed, then, in order to further demonstrate the superiority of NMPT, it is also compared with other methods: testing samples from target domain are also arranged, and there is no overlap between training and testing samples in target domain. Therefore, the total size of training set is 50 and 550 for LSSVM and the rest methods, respectively; the total size of testing set is 500. In order to quantitatively describe the domain differences, the Kullback-Leibler (KL) divergence is calculated by: where KL( ·|| ·) represents the KL divergence between Ds and Dt. Table 2 shows the descriptions of datasets (from DA1 to DA10) as well as their corresponding KL (13) X=H 1 + H 2 + · · · + H p + L p , divergences. It shows that the KL indexes of all the data sets are larger than zero, which means there exists differences between SD and TD indeed. The signals that come from the same axis have relatively small KL divergence compared with those from different axes (e.g., transferring among different rotating speeds: DA1/DA3/DA4 vs DA2, different loads: DA5/DA7/DA8 vs DA8). Meanwhile, the KL divergence of nonadjacent mechanical components is larger than those adjacent to each other (DA10 vs DA9).  First, Figures 5, 6, 7 and 8 give the visualized results of separating hyperplanes on four source domain datasets with three different fault types, including varying speeds (DA3), changing loads (DA7), adjacent mechanical parts (DA9 and DA10), to show the effectiveness of NMPT in minimizing the discrepancies of classification hyperplanes between SD and TD caused by operation conditions. Here, all datasets share the same target domain. By comparing these original classification hyperplanes, as is shown in Figure 5(a), Figure 6(a), Figure 7(a), Figure 8(a) and Figure 9, different working conditions can bring diversified results, which could easily cause erroneous diagnoses on target task when utilizing source domain samples as auxiliary training data directly. Whereas, NMPT tries to generalize the distinguishing ability from source domain to target domain, as shown in Figure 5(b), Figure 6(b), Figure 7(b) and Figure 8(b). Among them, Figure 5(b) and Figure 6(b) demonstrate similar results, which indicate that the proposed model are relatively more robust to transfer source domains from different speeds or loads compared with that from adjacent mechanical components.
Then, the performance of NMPT strategy for GFD from Test DA1 to DA10 are presented by confusion matrix, which are drawn in Figures 10, 11, and 12. In confusion matrix, the rows and columns show the actual and predicted fault types, respectively. The diagnostic accuracies of each fault type are shown in diagonal cells. Meanwhile, the misclassification rates are also listed outside the diagonal cells. Thus, from Figures 10,11,12 and Table 2, we can find that: (1) Even though there exists relatively high domain differences between SD and TD in some data sets (e.g., DA9 and DA10), the NMPT model can still learn a precise classification for target task (e.g., Figure 12(a) and (b)); (2) The NMPT model investigated in this study shows very similar GFD accuracies among varying loads (from DA5 to DA8), similar conclusion can be found in changing speeds (from DA1 to DA4), which verify the robustness of NMPT to sensor axis factors. Meanwhile, the best performance of NMPT under different loads happens in diverse sensor axes (DA6). Whereas, transferring among the same axis can achieve performance improvement in the cases of varying rotating speeds (DA1 & DA3); (3) The optimal classification performance occurs in the cases where source and target data come from the same gearbox (from DA1 to DA8), among them, the best classification accuracy of NMPT reaches 98.8% (DA1 & DA3). Besides, the performance of utilizing motor data to assist the fault recognition of reduction gearbox is lower than transferring between reduction gearbox and planetary gearbox; (4) By comparing the accuracy and error rates in all data sets, there are many factors that can affect the model performance, among them, the mechanical components that contribute source data is the most crucial element.
In general, the classification accuracy of NMPT is always over 94%. Therefore, NMPT model can avoid overfitting of GFD under various working conditions by making reasonable use of abundant labeled data form another working condition or adjacent components.
After investigating the classification performances of NMPT method on all data sets, it is still meaningful to further compare NMPT with other methods. Table 3 lists the comparison results from DA1 to DA10, which are calculated over the whole categories. Among them, the classification performance of LSSVM model is the lowest mainly due to two things: (a) the LSSVM model is trained only by using the insufficient target domain samples, which will inevitably hinder the generalization performance according to the principles of structural NMPT can make the best use of source domain samples to provide a performance improvement of diagnostic model for target task. Compared with other models, NMPT possesses the highest accuracy in the whole datasets (with the highest diagnostic accuracy: 98.8%), which proves the superiority of NMPT in utilizing source domain signals to assist GFD in target domain and provides a practical method for improving GFD performance.

Conclusions
(1) For the GFD problems under variable working conditions, the structure of a NMPT-theoretic strategy is presented, which utilizes ITD technology to structure fault characteristics for model parameter transferring. Experimental results indicate that the proposed method can achieve 97.16% diagnostic precision when the energies of first six level PRCs are set as feature vectors. (2) The visualization results verify that NMPT can generalize the distinguishing ability from source domain to target domain, which is beneficial for GFD under various working conditions. (3) With regard to the diagnostic performance, the NMPT model shows a strong robustness under different working conditions. Meanwhile, it can be found that the influence of working conditions on the GFD results is ordered by: rotating speed < load < location. (4) The proposed model parameter transfer strategy show better performance than other popular methods, because NMPT can further minimize the discrepancy of two decision boundaries over tasks. Thus, the proposed strategy is expected to be an effective and feasible tool to solve GFD problem with less labeled target training data. (5) In the future, we could explore the relationships between KL indicator, working condition factors and GFD results to improve the universality of the NMPT model.