Skip to main content

Model Parameter Transfer for Gear Fault Diagnosis under Varying Working Conditions


Gear fault diagnosis technologies have received rapid development and been effectively implemented in many engineering applications. However, the various working conditions would degrade the diagnostic performance and make gear fault diagnosis (GFD) more and more challenging. In this paper, a novel model parameter transfer (NMPT) is proposed to boost the performance of GFD under varying working conditions. Based on the previous transfer strategy that controls empirical risk of source domain, this method further integrates the superiorities of multi-task learning with the idea of transfer learning (TL) to acquire transferable knowledge by minimizing the discrepancies of separating hyperplanes between one specific working condition (target domain) and another (source domain), and then transferring both commonality and specialty parameters over tasks to make use of source domain samples to assist target GFD task when sufficient labeled samples from target domain are unavailable. For NMPT implementation, insufficient target domain features and abundant source domain features with supervised information are fed into NMPT model to train a robust classifier for target GFD task. Related experiments prove that NMPT is expected to be a valuable technology to boost practical GFD performance under various working conditions. The proposed methods provides a transfer learning-based framework to handle the problem of insufficient training samples in target task caused by variable operation conditions.

1 Introduction

Gear has been used extensively in transmission system due to its large velocity ratio, strong bearing capacity, compactness and high efficiency [14]. Gear fault diagnosis (GFD) also becomes one of the most important research hotspots from both industrial and academic communities for ensuring the safe and efficient operation of gear transmission system. Till now, with the development of sensing methods (e.g., vibration, rotor speed, acoustic signal and others), data-driven methods [5, 6], which are based on analyzing measured data without need of a deep understanding of the mechanical drive systems, have become more and more attractive and been proved to be valid in the field of gear fault diagnosis. Generally, there are two steps in data-driven method: (1) constructing a classification model based on sampled data, and (2) using the well-trained model to predict the mechanical fault type. In many existing researches, the fault diagnostic task can be treated as a problem of pattern recognition, which usually is composed of two technical processes: (1) feature extraction, and (2) fault recognition. The purpose of feature extraction is to obtain low-dimensional fault descriptors from high-dimensional vibration data. There are many advanced signal processing methods that have been proposed to provide cognizable features, such as wavelet transform (WT) [7], principal components analysis (PCA) [8], singular value decomposition (SVD) [9], empirical mode decomposition (EMD) [10], etc. Then, conventional machine learning methods (e.g., extreme learning machine, support vector machine and neural network) are employed to build a gear fault diagnostic model. However, these conventional methods usually work for GFD under constant speed and load conditions, thus having weak generality when facing with variable working conditions. Generally, gears are often working under time-varying operating conditions, for example, the running states of gas turbines or wind power generators change very often while working, and the operation parameters of planetary gearbox may vary correspondingly, thus inevitably resulting in a consequence that the extracted features in one time period might be different from those in the next time period. More importantly, the identical and independent distribution (IID) between training data and test data is required to ensure effective implementation of these conventional machine learning methods. Recently, these problems have aroused researchers’ interest and received intensive attentions. For example, Song et al. [11] developed a new singular value decomposition interpolation (SVDI) based signal processing method, in which the time-domain and frequency-domain characteristic matrices extracted from vibration signals under discrete working conditions were firstly decomposed into singular vectors, rotation matrices and characteristic means with SVD, then these three parts were interpolated to reconstruct the target eigenmatrix for data augmentation. Han et al. [12] utilized empirical mode decomposition (EMD) to decompose vibration signals into several intrinsic mode functions (IMF), and extracted feature vectors that consist of time domain indexes, frequency domain indexes, energy domain characteristic parameter and fractal box dimension from the selected IMF to investigate the dynamic feature of vibration signal accurately and improve the robustness of feature vectors under different loads for GFD. Meanwhile, Zhao et al. [13] designed a synchrosqueezing transform (SST) and deep convolutional neural network (DCNN) based method for gearbox fault classification under varying operation conditions, where a new index, the envelope time-frequency representation (TFR), was calculated by using SST, then DCNN was adopted to dig underlying features of the TFRs and determine the fault type of planetary gearbox automatically. In general, most of these methods can achieve good results by exploring advanced feature extraction methods or building a complex network classifier, but they rely on sufficient labeled training dataset normally, which could degrade performance when facing with insufficient data. However, only a few number of labeled samples collected for training probably exist in many real-world applications, which hinder the promotion of these methods greatly.

Therefore, how to train a robust model with high accuracy under limited labeled data is important. Recently, transfer learning (TL), a fast-growing filed of machine learning, has been emerging due to its knowledge transfer ability [14]. To be delighted, the amount of labeled target data (termed as target domain, TD) maybe small, but there are still plenty of relevant data which can be obtained in machine industry from another time period (e.g., under another speed and load) or adjacent components (termed as source domain, SD). By utilizing the TL technology, useful information can be extracted from existing or previous task to boost the learning efficiency of target task. The model parameter transfer (MPT), one of the transfer learning architectures, is an effective tool to transfer the shared parameters or prior distributions of hyperparameters. Recently, most of these approaches are designed to work for multitask learning (MTL). For example, Lawrence et al. [15] succeeded in learning parameters from multiple tasks through the shared Gaussian process (GP) prior. Bonilla et al. [16] proposed a GP-based model to learn the shared model knowledge over tasks. Schwaighofer et al. [17] succeed in learning multi-tasks by utilizing the combination of hierarchical Bayesian framework (HB) and GP. Besides, Evgenious et al. [18] proposed a new algorithm by referencing HB idea to solve multitask learning in the frame of support vector machine (SVM). All these methods can be easily modified for TL. Strictly speaking, MTL tries to learn different tasks jointly and simultaneously, while TL prefer to improve the performance of TD task with the help of knowledge extracted and stored from SD data. Comparison between MTL and TL is shown in Figure 1. Intuitively, we may minimize the difference in parameters of classification hyperplane between TD and SD to transfer the knowledge obtained from SD, so that a robust GFD model with better performance in TD can obtained.

Figure 1
figure 1

Comparison of multitask learning (MTL) and transfer learning (TL): a MTL; b TL

According to the above analysis, a novel model parameter transfer (NMPT) approach, which aims at excavating and further transferring the shared characteristic parameters of hyperplane for the problems of insufficient labeled training samples and non-IID between source and target domains, is developed to assist target gear fault identification using source domain data. Specifically, on this basis of controlling the empirical risk of source domain, the proposed method further integrates the advantage of the conventional MPT and TL together, which can be concluded that: (a) the least square support vector machine (LSSVM) based MPT can characterize the shared and domain-specific parameters over tasks; and (b) the idea of TL is introduced to dig and extract transferable knowledge and to minimize the distributional discrepancies between source and target domains. To sum up, the novelties and main contributions of this paper can be summarized as:

  • Based on controlling the empirical risk of source domain features in LSSVM framework, an improved TL model is proposed by further minimizing the discrepancies of separating hyperplanes between source and target domains, and then transferring both shared and domain-specific parameters over tasks to make use of source domain data to assist target diagnostic task;

  • The model parameter transfer idea is innovatively introduced to the area of gear fault diagnosis, which provides a new idea for gear fault diagnosis under variable working conditions, especially when sufficient training data from target domain are not available.

The rest of this paper is organized as follows. In Section 2, the theoretical background is briefly presented. Section 3 concentrates on introducing details of the proposed NMPT method and then gives the whole framework of GFD. Section 4 illustrates the experimental study and proves that NMPT can achieve good results in GFD under variable working conditions. Finally, some conclusions drawn from this paper are listed in Section 5.

2 Theoretical Background

This study is going to leverage the NMPT model under LSSVM framework for GFD. Therefore, in this section, the fundamental theory of LSSVM as well as its improvement for MTL are briefly reviewed.

2.1 Least Squares Support Vector Machine (LSSVM)

First, the basic principle of training a SVM-based model for classification problem is to find the optimal separating hyperplane (f = w*φ(x) + b) in a reproducing kernel Hilbert space (RKHS) [19]. According to structural risk minimization (SRM) principle, the optional w and b can be obtained by minimizing the following function:

$$\min \, R = \frac{1}{2}\left\| {\varvec{w}} \right\|^{2} + C \times R_{{{\text{emp}}}} ,$$

where C is positive real regularized parameter, w is weight vector defining the orientation of separating hyperplane, R represents structural risk, Remp denotes loss function which controls the error of separating hyperplane f on training data, and different kinds of Remp can contribute to different forms of SVMs. By utilizing squared error function, the SRM problem in LSSVM is to compute the optimal decision-made separating hyperplane according to the vector x and its label y{−1,+1} by minimizing the following function with a constraint, which can be formulated as:

$$\begin{array}{ll} \mathop {\min }\limits_{{{\varvec{\omega}},e,d}} \begin{array}{ll} {} \\ \end{array} J({\varvec{w}},e) = \frac{1}{2}\left\| {\varvec{w}} \right\|^{2} + \frac{C}{2}\sum\limits_{i = 1}^{N} {e_{i}^{2} } , \hfill \\ {\text{s.t.,}}\begin{array}{ll} {} \\ \end{array} y_{i} \{{\varvec{w}}^{\text{T}} \varphi ({\varvec{x}}_{i} ) + b\} ={\text{ 1}} - e_{i} ,\quad i = 1,2, \ldots ,N, \hfill \\ \end{array}$$

where ei is error function, φ(·) denotes a transform function that maps the input features x into RKHS, b is a bias term, N indicates the total number of training samples. Then a classification hyperplane f = w*φ(x) + b is constructed for this task.

2.2 Multi-Task LSSVM (MTLSSVM)

Given m learning tasks, the MTL aims to learn all tasks simultaneously rather than individually. Let each task im, we have ni training samples \(\left\{ {{\varvec{x}}_{i,j} ,y_{i,j} } \right\}_{j = 1}^{{n_{i} }}\), thus the total number of training samples is \(N = \sum\nolimits_{i = 1}^{m} {n_{i} }\).

Based on the regularization framework and hierarchical Bayesian framework, some researchers assumed that all wi can be rewritten as wi = w0 + vi, where w0 (playing the role of mean vector) and vi carry the information of commonality and specialty over tasks [20, 21], respectively. That is to say, when m learning tasks are analogous to each other, the vectors vi tend to be “small”, otherwise, the vector w0 tends to be “small”. To this end, the following optimization problem which is similar to LSSVM for single task is solved to estimate all vi as well as w0 simultaneously:

$$\begin{gathered} \mathop {\min }\limits_{{{\varvec{w}},e,d}} \begin{array}{*{20}c} {} \\ \end{array} J({\varvec{w}}_{0} ,\left\{ {{\varvec{v}}_{i} } \right\}_{i = 1}^{m} ,\left\{ {{\varvec{e}}_{i} } \right\}_{i = 1}^{m} ) \hfill \\ \, = \frac{1}{2}\left\| {{\varvec{w}}_{0} } \right\|^{2} + \frac{1}{2} \times \frac{\lambda }{m}\sum\limits_{i = 1}^{m} {\left\| {{\varvec{v}}_{i} } \right\|^{2} } + \frac{C}{2}\sum\limits_{i = 1}^{m} {{\varvec{e}}_{i}^{{\text{T}}} {\varvec{e}}_{i} } , \hfill \\ {\text{s.t.,}}\begin{array}{*{20}c} {} \\ \end{array} {(}{\varvec{w}}_{0} { + }{\varvec{v}}_{i} {)}^{\text{T}} {\varvec{Z}}_{i} + b_{i} {\varvec{y}}_{i} { = }{\mathbf{1}}_{{n_{i} }} - {\varvec{e}}_{i} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1,2, \cdots ,m, \hfill \\ \end{gathered}$$

where C and λ are positive real regularized parameters, \({\varvec{b}} = \left\{ {b_{1} ,\;b_{2} , \cdots ,\;b_{m} } \right\}^{\text{T}} ,\) \({\varvec{e}}_{i} = \left\{ {e_{i,1} ,\;e_{i,2} , \cdots ,\;e_{{i,n_{i} }} } \right\}^{\text{T}} ,\) \({\varvec{Z}}_{i} = \left\{ {\varphi ({\varvec{x}}_{i,1} )y_{i,1} } \right.,\) \(\varphi ({\varvec{x}}_{i,2} )y_{i,2} , \cdots ,\;\varphi ({\varvec{x}}_{{i,n_{i} }} )y_{{i,n_{i} }} \} ,\) \({\varvec{y}}_{i} = \left\{ {y_{i,1} ,\;y_{i,2} , \cdots ,\;y_{{i,n_{i} }} } \right\}^{\text{T}} .\)

These previous works of LSSVM and MTLSSVM are not oriented to the target task where there exists the problem of insufficient training data or non-IID between training and testing data. Whereas, it is significant to derive useful information from these existed models to enhance the TD task. Therefore, different from the single task learning and multitask learning, the proposed NMPT utilizes SD data (related but different from TD) to solve target domain problems with a specific structure, which is introduced in the following section.

3 Proposed NMPT Framework for GFD

The proposed NMPT method via transferring the knowledge of classification hyperplane from SD to TD is presented in this section.

3.1 Basic Definition

Given SD and TD, the main purpose of NMPT can be described as: under LSSVM framework, NMPT aims to improve the performance of TD classification model ft = wt*xt + bt by using the knowledge from source domain classifiers model fs = ws*xs + bs, where the SD and TD are different but similar in some aspects. In addition, the training data is set as follows:

$$\begin{gathered} Ds = \left\{ {({\varvec{x}}_{j}^{s} ,y_{j}^{s} )} \right\},j = 1,2, \cdots ,Ns, \hfill \\ Dt = \left\{ {({\varvec{x}}_{i}^{t} ,y_{i}^{t} )} \right\},i = 1,2, \cdots ,Nt, \hfill \\ \end{gathered}$$

where Ds, Dt are SD and TD labeled data, respectively; \({\varvec{x}}_{j}^{s}\),\(y_{j}^{s}\) denote the jth feature vector and corresponding label of SD data; \({\varvec{x}}_{i}^{t}\),\(y_{i}^{t}\) denote the ith feature vector and corresponding label of TD data; Ns and Nt represent the number of SD and TD, in this paper, Nt<< Ns.

3.2 NMPT Architecture

In this section, the proposed NMPT approach is discussed. As mentioned above, the method mainly utilizes the labeled data from SD and TD to solve the target GFD problem. First, inspired by the work of multitask LSSVM framework [21, 22], we assume that the parameters, wt and ws form both tasks can be separated into two parts, respectively:

$${\varvec{w}}_{{\text{t}}} = {\varvec{w}}_{0} + {\varvec{v}}_{{\text{t}}} ,_{ } {\varvec{w}}_{{\text{s}}} = {\varvec{w}}_{0} + {\varvec{v}}_{{\text{s}}}$$

where w0 is the shared parameter, vs and vt are the domain-specific parameters of SD and TD tasks, respectively. Then, based on previous transfer strategy that controls empirical risk of source domain, we want to find the knowledge from ws and transfer it to wt ulteriorly. As enough training data can prevent the model from overfitting, parameter w0 from ws is set as one of transfer knowledge. In addition, by minimizing the term μ|| vtvs ||2 during the optimization process, we can also recognize and apply knowledge of vs learned from SD. Hence, to achieve this goal, an extension of LSSVM to transfer learning case is built as follows:

$$\begin{array}{ll} \mathop {\min }\limits_{{\varvec{w},e,d}} J(w_{0} ,v_{t} ,v_{s} ,e) \hfill \\ = \frac{1}{2}\left\| {\varvec{w}_{0} } \right\|^{2} + \frac{1}{2} \times \frac{\lambda }{2}\left( {\left\| {\varvec{v}_{t} } \right\|^{2} + \left\| {\varvec{v}_{s} } \right\|^{2} } \right) + \frac{{Ct}}{2}\sum\limits_{{i = 1}}^{{Nt}} {e_{i}^{2} } \hfill \\ + \frac{{Cs}}{2}\sum\limits_{{i = Nt + 1}}^{{Ns{{ + }}Nt}} {e_{i}^{2} } + \mu \left\| {\varvec{v}_{t} - \varvec{v}_{s} } \right\|^{2} , \hfill \\ {\text{s.t.,}}y_{i}^{t} \{{ (\varvec{w}_{0}+ \varvec{v}_{t} )}^{\text{T}} \varphi (\varvec{x}_{i}^{t} ) + b_{t} \} = 1 - e_{i} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1,2, \cdots ,Nt,{\kern 1pt} {\kern 1pt} \hfill \\ y_{j}^{s} \{ {(\varvec{w}_{0} +\varvec{v}_{s} )}^{\text{T}} \varphi (\varvec{x}_{j}^{s} ) + b_{s} \} = 1 - e_{j} ,{\kern 1pt} {\kern 1pt} j = 1,2, \ldots ,Ns,{\kern 1pt} {\kern 1pt} \hfill \\ \end{array}$$

where w0 and μ|| vtvs ||2 are transfer learning items, Cs, Ct, λ and μ are positive real regularized parameters. An illustration that describes the diagram of NMPT is presented in Figure 2.

Figure 2
figure 2

Schematic diagram of NMPT

As less tagged target training data will cause the corresponding classification model to show some tendency towards performance degradation, the decision boundary with parameter wt from target task could suffer from this problem. However, by utilizing the knowledge of ws from source domain, NMPT architecture can ensure a relatively small generalization error on the target domain by mainly focusing on achieving the following goals: (1) learning a more accurate w0 for target domain; (2) reducing the difference of model parameters by minimizing μ|| vtvs ||2 (see the purple line in Figure 2). These two goals can make source domain model be applicable for target domain and ensure the leading role of Dt in building classification model for target task. In addition, by comparing eq. (2) with eq. (6), we find the NMPT model tries to make the separating hyperplane of SD be qualified for TD classification task from two aspects on the basis of SRM principle: one is to minimize the margin discrepancies of training data between SD and TD to adjust separating hyperplane, the other is to control loss function on SD data, simultaneously. All these two improvements can prove a good capability of generalization on TD.

Then, the solving process of NMPT optimization problem (c.f. Eq. (6)) is listed as follows:

First, the Lagrangian function for Eq. (6) is built as:

$$\begin{gathered} L({\varvec{w}}_{0} ,{\varvec{v}}_{t} ,{\varvec{v}}_{s} ,b,e,a) \hfill \\ = \frac{1}{2}\left\| {{\varvec{w}}_{0} } \right\|^{2} + \frac{1}{2} \times \frac{\lambda }{2}\left( {\left\| {{\varvec{v}}_{t} } \right\|^{2} + \left\| {{\varvec{v}}_{s} } \right\|^{2} } \right) + \frac{Ct}{2}\sum\limits_{i = 1}^{Nt} {e_{i}^{2} } \hfill \\ { + }\frac{Cs}{2}\sum\limits_{{i = Nt{ + }1}}^{{Ns{ + }Nt}} {e_{i}^{2} } { + }\mu \left\| {{\varvec{v}}_{t} - {\varvec{v}}_{s} } \right\|^{2} \hfill \\ \, - \sum\limits_{i = 1}^{Nt} {a_{i} } \left\{ {y_{i}^{t} {\{ (}}{\varvec{w}}_{0} { + }{\varvec{v}}_{t} {)}^{\text{T}} \varphi ({\varvec{x}}_{i}^{t} ) + b_{t} {\} } - 1 + e_{i} \right\} \hfill \\ \, - \sum\limits_{i = Nt + 1}^{Nt + Ns} {a_{i} } \left\{ {y_{i}^{s} {\{ (} {\varvec{w}}_{0} { + }{\varvec{v}}_{s} {)}^{\text{T}} \varphi ({\varvec{x}}_{i}^{s} ) + b_{s} {\} }} - 1 + e_{i} \right\}, \hfill \\ \end{gathered}$$

where ai is a Lagrange multiplier. Then, according to Karush–Kuhn–Tucker (KKT) conditions, the solutions for optimality are yielded as:

$$\begin{gathered} \frac{{\partial L}}{{\partial {\varvec w}_{0} }} = 0 \to {\varvec w}_{0} = \sum\limits_{{i = 1}}^{{Nt}} {a_{i} y_{i}^{t} \varphi ({\varvec x}_{i}^{t} )} + \sum\limits_{{i = Nt + 1}}^{{Nt + Ns}} {a_{i} y_{i}^{s} \varphi ({\varvec x}_{i}^{s} )} , \hfill \\ \frac{{\partial L}}{{\partial {\varvec v}_{t} }} = 0 \to \frac{\lambda }{2}{\varvec v}_{t} + 2\mu ({\varvec v}_{t} - {\varvec v}_{s} ) - \sum\limits_{{i = 1}}^{{Nt}} {a_{i} y_{i}^{t} \varphi ({\varvec x}_{i}^{t} )} = 0, \hfill \\ \frac{{\partial L}}{{\partial {\varvec v}_{s} }} = 0 \to \frac{\lambda }{2}{\varvec v}_{s} + 2\mu ({\varvec v}_{s} - {\varvec v}_{t} ) - \sum\limits_{{i = Nt + 1}}^{{Nt + Ns}} {a_{i} y_{i}^{s} \varphi ({\varvec x}_{i}^{s} )} = 0, \hfill \\ \frac{{\partial L}}{{\partial b_{t} }} = 0 \to \sum\limits_{{i = 1}}^{{Nt}} {a_{i} y_{i}^{t} } = 0, \hfill \\ \frac{{\partial L}}{{\partial b_{s} }} = 0 \to \sum\limits_{{i = 1}}^{{Ns}} {a_{i} y_{i}^{s} } = 0, \hfill \\ \frac{{\partial L}}{{\partial e_{i} }} = 0 \to a_{i} = Ce_{i} , \hfill \\ \frac{{\partial L}}{{\partial a_{i} }} = 0 \to \left\{ {\begin{array}{*{20}c} {y_{i}^{t} \{ ({\varvec w}_{0} + {\varvec v}_{t} )^{{\text{T}}} \varphi ({\varvec x}_{i}^{t} ) + b_{t} \}- 1 + e_{i} = 0} \\ {(i = 1,2, \ldots , Nt)} \\ {y_{i}^{s} \{ ({\varvec w}_{0} + {\varvec v}_{s} )^{{\text{T}}} \varphi ({\varvec x}_{i}^{s} ) + b_{s} \}- 1 + e_{i} = 0} \\ {(i = Nt + 1,Nt + 2, \ldots , Nt + Ns),} \\ \end{array} } \right. \hfill \\ \end{gathered}$$

where vt and vs can be derived as:

$${\varvec{v}}_{t} = \frac{{\left( {1 + \frac{4\mu }{\lambda }} \right){\varvec{w}}_{0} - \sum\limits_{i = Nt + 1}^{Nt + Ns} {a_{i} y_{i}^{s} \varphi ({\varvec{x}}_{i}^{s} )} }}{{\frac{\lambda }{2} + 4\mu }} = \frac{{\frac{4\mu }{\lambda }\left( {\sum\limits_{i = 1}^{Nt} {a_{i} y_{i}^{t} \varphi ({\varvec{x}}_{i}^{t} )} + \sum\limits_{i = Nt + 1}^{Nt + Ns} {a_{i} y_{i}^{s} \varphi ({\varvec{x}}_{i}^{s} )} } \right){ + }\sum\limits_{i = 1}^{Nt} {a_{i} y_{i}^{t} \varphi ({\varvec{x}}_{i}^{t} )} }}{{\frac{\lambda }{2} + 4\mu }}, {\varvec{v}}_{s} = \frac{{\left( {1 + \frac{4\mu }{\lambda }} \right){\varvec{w}}_{0} - \sum\limits_{i = 1}^{Nt} {a_{i} y_{i}^{t} \varphi ({\varvec{x}}_{i}^{t} )} }}{{\frac{\lambda }{2} + 4\mu }} = \frac{{\frac{4\mu }{\lambda }\left( {\sum\limits_{i = 1}^{Nt} {a_{i} y_{i}^{t} \varphi ({\varvec{x}}_{i}^{t} )} + \sum\limits_{i = Nt + 1}^{Nt + Ns} {a_{i} y_{i}^{s} \varphi ({\varvec{x}}_{i}^{s} )} } \right){ + }\sum\limits_{i = Nt + 1}^{Nt + Ns} {a_{i} y_{i}^{s} \varphi ({\varvec{x}}_{i}^{s} )} }}{{\frac{\lambda }{2} + 4\mu }}.$$

By eliminating w0, vt, vs and ei through substitution, one linear system can be obtained as follows:

$$\left[ {\begin{array}{*{20}c} \varvec{0} \\ \varvec{Y} \\ \end{array} \begin{array}{*{20}c} {{\varvec{Y}}_{1} } \\ {\begin{array}{*{20}c}{\varvec{\varOmega}}\\ \end{array} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {\varvec{b}} \\ {\varvec{a}} \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {\varvec{0}} \\ {\overline{\user2{I}}} \\ \end{array} } \right],$$

where \({\varvec{a}} = \left[ {a_{1} ,a_{2} , \cdots ,a_{Nt} ,a_{Nt + 1} , \cdots ,a_{Nt + Ns} } \right]^{\text{T}} ,\) \({\varvec{b}} = \left[ {b_{t} ,b{}_{s}} \right]^{\text{T}} ,\)\({\varvec{Y}}_{1} = [y_{1}^{t} ,y_{2}^{t} , \cdots ,y_{Nt}^{t} ,y_{1}^{s} ,y_{2}^{s} , \cdots ,y_{Ns}^{s} ],\) \({\varvec{I}} = [1,1, \cdots ,1]_{(Nt + Ns) \times 1} ,\)\({\varvec{0}} = \left[ {0,0} \right],\) Y = blockdiag(ys, yt), \({\varvec{y}}_{t} = [y_{1}^{t} ,y_{2}^{t} , \cdots ,y_{Nt}^{t} ]^{\text{T}} ,\)\({\varvec{y}}_{s} = [y_{1}^{s} ,y_{2}^{s} , \cdots ,y_{Ns}^{s} ]^{\text{T}} ,\) Ω is (Nt + Ns) × (Nt + Ns) symmetric matrix \({\varvec{\varOmega}}{ = }\Omega_{0} + \Omega_{1} + \frac{1}{C}{\varvec{I}}_{Nt + Ns} ,\) Ω1= blockdiag(Ωt, Ωs), K represents the kernel function, the detail element in Ω is defined as:

$$\begin{gathered} \Omega_{0ij} = \left( {1 + {{\frac{4\mu }{\lambda }} \mathord{\left/ {\vphantom {{\frac{4\mu }{\lambda }} {\left( {\frac{\lambda }{2} + 4\mu } \right)}}} \right. \kern-\nulldelimiterspace} {\left( {\frac{\lambda }{2} + 4\mu } \right)}}} \right)y_{i} y_{j} K({\varvec{x}}_{i} ,{\varvec{x}}_{j} ), \, y_{i} ,y_{j} \in {\varvec{Y}}_{1} , \hfill \\ \left( {{\varvec{x}}_{i} ,\;{\varvec{x}}_{j} \in \left[ {{\varvec{x}}_{1}^{t} ,\;{\varvec{x}}_{2}^{t} , \cdots ,\;{\varvec{x}}_{Nt}^{t} ,\;{\varvec{x}}_{1}^{s} ,\;{\varvec{x}}_{2}^{s} \cdots ,\;{\varvec{x}}_{Ns}^{s} } \right]{,}} \right) \hfill \\ \Omega_{tij} = \frac{1}{{\frac{\lambda }{2} + 4\mu }}y_{i}^{t} y_{j}^{t} K({\varvec{x}}_{i}^{t} ,{\varvec{x}}_{j}^{t} ) \, ,i,j \in \left[ {1,Nt} \right], \hfill \\ \Omega_{sij} = \frac{1}{{\frac{\lambda }{2} + 4\mu }}y_{i}^{s} y_{j}^{s} K({\varvec{x}}_{i}^{s} ,{\varvec{x}}_{j}^{s} ),i,j \in \left[ {1,Ns} \right]. \hfill \\ \end{gathered}$$

The best fit values of parameters a, bt and bs can be finally worked out, then the corresponding decision function can be constructed as follows:

$$y = sgn\left[ {\left( {1 + {{\frac{4\mu }{\lambda }} \mathord{\left/ {\vphantom {{\frac{4\mu }{\lambda }} {\left( {\frac{\lambda }{2} + 4\mu } \right)}}} \right. \kern-\nulldelimiterspace} {\left( {\frac{\lambda }{2} + 4\mu } \right)}}} \right) \times \left( {\sum\limits_{i = 1}^{Nt} {a_{i} y_{i}^{t} K({\varvec{x}}_{i}^{t} ,{\varvec{x}})} + \sum\limits_{i = Nt + 1}^{Nt + Ns} {a_{i} y_{i}^{s} K({\varvec{x}}_{i}^{s} ,{\varvec{x}})} } \right) + \frac{1}{{\frac{\lambda }{2} + 4\mu }}\sum\limits_{j = 1}^{Nt} {a_{i} y_{j}^{t} K({\varvec{x}}_{j}^{t} ,{\varvec{x}})} + b_{t} } \right].$$

3.3 Complete Process of NMPT Model for Gear Fault Diagnosis

In the proposed framework, an intrinsic time-scale decomposition (ITD) architecture is first introduced to decompose a vibration signal into a set of proper rotation components (PRCs). Then, the energy parameter of each proper rotation component (PRC) is calculated to conduct dimensionality reduction and construct feature vectors. By structuring and solving the optimization problem of NMPT (c.f. Eq. (6)) using the learned fault representations, the parameters of NMPT model (including w0, vs vt, bs and bt) can be learned simultaneously. Finally, the target data are fed into NMPT to output the predicted fault categories. Figure 3 gives the overall proposed framework for NMPT-based GFD.

Figure 3
figure 3

The whole framework of our method for GFD

4 Experiment and Discussion

4.1 Descriptions of Experimental Simulator and Datasets

To conduct experimental verification, the testing platform, drivetrain dynamics simulator (DDS), is shown in Figure 4. It includes driving motor, speed regulator, planetary gearbox, reduction gearbox, brake device, brake regulator. During data collection, the variety of speeds and loads can be implemented through speed regulator and brake regulator, respectively. Meanwhile, there are altogether 7 vibration sensors (model: 608A11, sample frequency: 5120 Hz) in the structure, one is mounted on the surface of motor to measure z-axial vibration signal of the motor (F1), the rest are as follows: three for planetary gearbox (F2) and three for reduction gearbox (F3). Except for the healthy gear (Healthy, C1), there are four different types of gear faults, denoted as a small piece of material breaking away from tooth (Chipped, C2), a tooth fracturing at the location of root (Missing, C3), the emergence of cracks on root cracked (Cracked, C4) and the loss of material from the contacting surface of tooth (Worn, C5). The descriptions of fault types and different experiment conditions are shown in Table 1.

Figure 4
figure 4

Spectra quest’s drivetrain dynamics simulator: (a) The real chart of system; (b) The structure chart of system

Table 1 Gear fault type and working conditions

4.2 Experimental Results and Analysis

4.2.1 Feature Extraction

Intrinsic time-scale decomposition (ITD) , proposed by Frei et al. [23], is a time frequency analysis method which can adaptively decompose a given vibration signal X into a series of proper rotation components (PRCs) and a monotonous trend signal (remaining baseline signal) with low end effects and high efficiency, which can described as:

$$X{ = }H^{1} + H^{2} + \cdot \cdot \cdot + H^{p} + L^{p} ,$$

where p denotes the final decomposition level, Hi is the ith PRC, Lp is the remaining baseline signal.

Nevertheless, these obtained PRCs with ITD technology are too complex to be taken as fault vectors as inputs for conducting fault classification directly. Thus, the energies of first six level PRCs are calculated for dimensionality reduction of PRCs and fault feature design.

4.2.2 Experimental Study

In this part, the diagnostic performance of the proposed NMPT is first analyzed, then, in order to further demonstrate the superiority of NMPT, it is also compared with other methods:

  • LSSVM(non-transfer): Least squares support vector machine;

  • MTLSSVM (non-transfer): Multi-Task LSSVM;

  • TCA [24]: Transfer component analysis;

  • DSM [25]: Domain selection machine;

  • ELSSVM [26]: Enhanced LSSVM

For a fair comparison, all kernel-based methods use the Radial Basis Function (RBF) as the kernel function. In this study, 2000 sampled data points of original vibration signal under each specific working condition were fed into ITD model for feature extraction. Regardless in source or target domain, each gear fault category contains 200 samples under any chosen working condition. The datasets to perform experiments are set as follows: for LSSVM, 10 samples of each fault type are selected from target domain; for MTLSSVM and those transfer strategies, both the aforesaid 10 target domain samples and 100 source domain samples are arranged. Moreover, 100 testing samples from target domain are also arranged, and there is no overlap between training and testing samples in target domain. Therefore, the total size of training set is 50 and 550 for LSSVM and the rest methods, respectively; the total size of testing set is 500. In order to quantitatively describe the domain differences, the Kullback-Leibler (KL) divergence is calculated by:

$${\text{KL}}(Ds,Dt) = \frac{{\text{KL}(Ds||Dt) + {\text{KL}}(Dt||Ds)}}{2},$$

where KL( ·|| ·) represents the KL divergence between Ds and Dt. Table 2 shows the descriptions of datasets (from DA1 to DA10) as well as their corresponding KL divergences. It shows that the KL indexes of all the data sets are larger than zero, which means there exists differences between SD and TD indeed. The signals that come from the same axis have relatively small KL divergence compared with those from different axes (e.g., transferring among different rotating speeds: DA1/DA3/DA4 vs DA2, different loads: DA5/DA7/DA8 vs DA8). Meanwhile, the KL divergence of nonadjacent mechanical components is larger than those adjacent to each other (DA10 vs DA9).

Table 2 Specific tests in experimental section

First, Figures 5, 6, 7 and 8 give the visualized results of separating hyperplanes on four source domain datasets with three different fault types, including varying speeds (DA3), changing loads (DA7), adjacent mechanical parts (DA9 and DA10), to show the effectiveness of NMPT in minimizing the discrepancies of classification hyperplanes between SD and TD caused by operation conditions. Here, all datasets share the same target domain. By comparing these original classification hyperplanes, as is shown in Figure 5(a), Figure 6(a), Figure 7(a), Figure 8(a) and Figure 9, different working conditions can bring diversified results, which could easily cause erroneous diagnoses on target task when utilizing source domain samples as auxiliary training data directly. Whereas, NMPT tries to generalize the distinguishing ability from source domain to target domain, as shown in Figure 5(b), Figure 6(b), Figure 7(b) and Figure 8(b). Among them, Figure 5(b) and Figure 6(b) demonstrate similar results, which indicate that the proposed model are relatively more robust to transfer source domains from different speeds or loads compared with that from adjacent mechanical components.

Figure 5
figure 5

Classification hyperplane of DA3-SD: a Original DA3-SD; b After model parameter transfer

Figure 6
figure 6

Classification hyperplane of DA7-SD: a Original DA7-SD; b After model parameter transfer

Figure 7
figure 7

Classification hyperplane of DA9-SD: a Original DA9-SD; b After model parameter transfer

Figure 8
figure 8

Classification hyperplane of DA10-SD: a Original DA10-SD; b After model parameter transfer

Figure 9
figure 9

The original classification hyperplane of [S2,L1,F3]-z

Figure 10
figure 10

Confusion matrix of NMPT on transferring datasets with varying speeds

Figure 11
figure 11

Confusion matrix of NMPT on transferring datasets with different loads

Figure 12
figure 12

Confusion matrix of NMPT on transferring datasets from adjacent components

Then, the performance of NMPT strategy for GFD from Test DA1 to DA10 are presented by confusion matrix, which are drawn in Figures 10, 11, and 12. In confusion matrix, the rows and columns show the actual and predicted fault types, respectively. The diagnostic accuracies of each fault type are shown in diagonal cells. Meanwhile, the misclassification rates are also listed outside the diagonal cells. Thus, from Figures 10, 11, 12 and Table 2, we can find that:

(1) Even though there exists relatively high domain differences between SD and TD in some data sets (e.g., DA9 and DA10), the NMPT model can still learn a precise classification for target task (e.g., Figure 12(a) and (b));

(2) The NMPT model investigated in this study shows very similar GFD accuracies among varying loads (from DA5 to DA8), similar conclusion can be found in changing speeds (from DA1 to DA4), which verify the robustness of NMPT to sensor axis factors. Meanwhile, the best performance of NMPT under different loads happens in diverse sensor axes (DA6). Whereas, transferring among the same axis can achieve performance improvement in the cases of varying rotating speeds (DA1 & DA3);

(3) The optimal classification performance occurs in the cases where source and target data come from the same gearbox (from DA1 to DA8), among them, the best classification accuracy of NMPT reaches 98.8% (DA1 & DA3). Besides, the performance of utilizing motor data to assist the fault recognition of reduction gearbox is lower than transferring between reduction gearbox and planetary gearbox;

(4) By comparing the accuracy and error rates in all data sets, there are many factors that can affect the model performance, among them, the mechanical components that contribute source data is the most crucial element.

In general, the classification accuracy of NMPT is always over 94%. Therefore, NMPT model can avoid overfitting of GFD under various working conditions by making reasonable use of abundant labeled data form another working condition or adjacent components.

After investigating the classification performances of NMPT method on all data sets, it is still meaningful to further compare NMPT with other methods. Table 3 lists the comparison results from DA1 to DA10, which are calculated over the whole categories. Among them, the classification performance of LSSVM model is the lowest mainly due to two things: (a) the LSSVM model is trained only by using the insufficient target domain samples, which will inevitably hinder the generalization performance according to the principles of structural risk minimization; and (b) the standard LSSVM model is lack of transferring knowledge among domains, while NMPT can make the best use of source domain samples to provide a performance improvement of diagnostic model for target task. Compared with other models, NMPT possesses the highest accuracy in the whole datasets (with the highest diagnostic accuracy: 98.8%), which proves the superiority of NMPT in utilizing source domain signals to assist GFD in target domain and provides a practical method for improving GFD performance.

Table 3 Total GFD accuracies from test DA1 to DA10

5 Conclusions

  1. (1)

    For the GFD problems under variable working conditions, the structure of a NMPT-theoretic strategy is presented, which utilizes ITD technology to structure fault characteristics for model parameter transferring. Experimental results indicate that the proposed method can achieve 97.16% diagnostic precision when the energies of first six level PRCs are set as feature vectors.

  2. (2)

    The visualization results verify that NMPT can generalize the distinguishing ability from source domain to target domain, which is beneficial for GFD under various working conditions.

  3. (3)

    With regard to the diagnostic performance, the NMPT model shows a strong robustness under different working conditions. Meanwhile, it can be found that the influence of working conditions on the GFD results is ordered by: rotating speed < load < location.

  4. (4)

    The proposed model parameter transfer strategy show better performance than other popular methods, because NMPT can further minimize the discrepancy of two decision boundaries over tasks. Thus, the proposed strategy is expected to be an effective and feasible tool to solve GFD problem with less labeled target training data.

  5. (5)

    In the future, we could explore the relationships between KL indicator, working condition factors and GFD results to improve the universality of the NMPT model.



Gear fault diagnosis


Model parameter transfer


Intrinsic time-scale decomposition


Least squares support vector machine


Multi-task LSSVM


Drivetrain dynamics simulator


  1. F Shen, C Chen, R Q Yan, et al. A fast multi-tasking solution: NMF-theoretic co-clustering for gear fault diagnosis under variable working conditions. Chinese Journal of Mechanical Engineering, 2020, 33: 16.

    Article  Google Scholar 

  2. X H Jin, Y Sun, J H Shan, et al. Fault diagnosis and prognosis for wind turbines: An overview. Chinese Journal of Scientific Instrument, 2017, 38(5): 1041-1053. (in Chinese)

    Google Scholar 

  3. L M Wang, Y M Shao. Crack fault classification for planetary gearbox based on feature selection technique and K-means clustering method. Chinese Journal of Mechanical Engineering, 2018, 31: 4.

    Article  Google Scholar 

  4. R N Liu, B Y Yang, E Zio, et al. Artificial intelligence for fault diagnosis of rotating machinery: A review. Mechanical Systems and Signal Processing, 2018, 108: 33-47.

    Article  Google Scholar 

  5. J Yu, Y He. Planetary gearbox fault diagnosis based on data-driven valued characteristic multigranulation model with incomplete diagnostic information. Journal of Sound and Vibration, 2018, 429: 63-77.

    Article  Google Scholar 

  6. Z Gao, C Cecati, S X Ding. A survey of fault diagnosis and fault-tolerant techniques—Part I: Fault diagnosis with model-based and signal-based approaches. IEEE Transactions on Industrial Electronics, 2015, 62(6): 3757-3767.

    Article  Google Scholar 

  7. R Q Yan, R X Gao, X F Chen. Wavelets for fault diagnosis of rotary machines: A review with applications. Signal Processing, 2014, 96(PART A): 1-15.

  8. S J Deng, L W Tang, X T Zhang. Gear fault diagnosis based on an adaptive neighborhood incremental PCA-LPP manifold learning algorithm. Journal of Vibration and Shock, 2017, 36(14): 111-132. (in Chinese)

    Google Scholar 

  9. M Zeng, Y Yang, J S Cheng, et al. µ-SVD based denoising method and its application to gear fault diagnosis. Journal of Mechanical Engineering, 2015, 51(3): 95-103. (in Chinese)

    Article  Google Scholar 

  10. S Park, S Kim, J Choi. Gear fault diagnosis using transmission error and ensemble empirical mode decomposition. Mechanical Systems and Signal Processing, 2018, 108: 262-275.

    Article  Google Scholar 

  11. T Song, Y L Wang, M F Zhao, et al. Fault diagnosis for rotating machineries under variable operation conditions based on SVDI. Journal of Vibration and Shock, 2018, 37(19): 211-216. (in Chinese)

    Google Scholar 

  12. D Y Han, N Zhao, P M Shi. Gear fault feature extraction and diagnosis method under different load excitation based on EMD, PSO-SVM and fractal box dimension. Journal of Mechanical Science and Technology, 2019, 33(2): 487-494.

    Article  Google Scholar 

  13. D Z Zhao, T Y Wang, F L Chu. Deep convolutional neural network based planet bearing fault classification. Computers in Industry, 2019, 107: 59-66.

    Article  Google Scholar 

  14. S J Pan, Q Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1345-1359.

    Article  Google Scholar 

  15. N D Lawrence, J C Platt. Learning to learn with the informative vector machine. Proceedings of the 21th International Conference on Machine Learning, Banff, Alberta, Canada, July 4-8, 2004: 65-72.

  16. E V Bonilla, K M A Chai, C K I Williams. Multi-task Gaussian process prediction. Proceedings of the 22th Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008: 153-160.

  17. A Schwaighofer, V Tresp, K Yu. Learning Gaussian process kernels via hierarchical Bayes. Proceedings of the 18th Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 13-18, 2004: 1209-1216.

  18. T Evgenious, M Pontil. Regularized multi-task learning. Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22-25, 2004: 109-117.

  19. L Chen, S Zhou. Sparse algorithm for robust LSSVM in primal space. Neurocomputing, 2018, 275: 2880-2891.

    Article  Google Scholar 

  20. R Q Yan, F Shen, C Sun, et al. Knowledge transfer for rotary machine fault diagnosis. IEEE Sensors Journal, 2020, 20(15): 8374-8393.

    Article  Google Scholar 

  21. S Xu, X An, X Qiao, et al. Multi-task least-squares support vector machines. Multimedia Tools and Applications, 2014, 71(2): 699-715.

    Article  Google Scholar 

  22. C A Micchelli, M Pontil. Kernels for multi-task learning. Proceedings of the 18th Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 13-18, 2004: 921-928.

  23. M G Frei, I Osorio. Intrinsic time-scale decomposition: time–frequency–energy analysis and real-time filtering of non-stationary signals. Proceedings of the Royal Society A Mathematical Physical and Engineering Sciences, 2007, 463(2078): 321-342.

    Article  MathSciNet  Google Scholar 

  24. S J Pan, I W Tsang, J T Kwok, et al. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 2011, 22(2): 199-210.

    Article  Google Scholar 

  25. L X Duan, D Xu, S F Chang. Exploiting web images for event recognition in consumer videos: A multiple source domain adaptation approach. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012: 1338–1345.

  26. C Chen, F Shen, R Q Yan. Enhanced least squares support vector machine-based transfer learning strategy for bearing fault diagnosis. Chinese Journal of Scientific Instrument, 2017, 38(1): 33-40. (in Chinese)

    Google Scholar 

Download references


Not applicable.


Supported by National Natural Science Foundation of China (Grant No. 51835009).

Author information

Authors and Affiliations



RY and JX designed the experiment, CC and FS analyzed the data, all the authors wrote and improved the paper. All authors read and approved the final manuscript.

Authors’ information

Chao Chen received his B.Sc. and M.Sc. degree from Jiangsu University in 2011 and 2014 respectively. Now he is pursuing his PhD degree in School of Instrument Science and Engineering, Southeast University. His main research interest is machine fault diagnosis..

Fei Shen received his B.Sc. and M.Sc. degree from Southeast University in 2014 and 2016 respectively. Now he is pursuing his PhD degree in School of Instrument Science and Engineering, Southeast University. His main research interest is machine fault diagnosis..

Jiawen Xu is currently an associate researcher in School of Instrument Science and Engineering, Southeast University.

Ruqiang Yan received his B.Sc. and M.E. degree from University of Science and Technology of China in 1997 and 2002 respectively, and received his Ph.D. degree in 2007 from University of Massachusetts, Amherst. Now he is a professor and Ph.D. supervisor in Xi’an Jiaotong University. His main research interests include machine condition monitoring and fault diagnosis, signal processing, and wireless sensor networks.

Corresponding author

Correspondence to Ruqiang Yan.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, C., Shen, F., Xu, J. et al. Model Parameter Transfer for Gear Fault Diagnosis under Varying Working Conditions. Chin. J. Mech. Eng. 34, 13 (2021).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: