A Fast Multi-tasking Solution: NMF-Theoretic Co-clustering for Gear Fault Diagnosis under Variable Working Conditions

Most gear fault diagnosis (GFD) approaches suffer from inefficiency when facing with multiple varying working conditions at the same time. In this paper, a non-negative matrix factorization (NMF)-theoretic co-clustering strategy is proposed specially to classify more than one task at the same time using the high dimension matrix, aiming to offer a fast multi-tasking solution. The short-time Fourier transform (STFT) is first used to obtain the time-frequency features from the gear vibration signal. Then, the optimal clustering numbers are estimated using the Bayesian information criterion (BIC) theory, which possesses the simultaneous assessment capability, compared with traditional validity indexes. Subsequently, the classical/modified NMF-based co-clustering methods are carried out to obtain the classification results in both row and column tasks. Finally, the parameters involved in BIC and NMF algorithms are determined using the gradient ascent (GA) strategy in order to achieve reliable diagnostic results. The Spectra Quest’s Drivetrain Dynamics Simulator gear data sets were analyzed to verify the effectiveness of the proposed approach.


Introduction
In those large-scale rotating machines, wear and tear always comes out in the teeth surface of driving gears if the pressure is not even or some extra impurities are mingled in the lubricating oil. Health monitoring technology of mechanical components has been proved to be effective at discovering early abrasion and reducing the failure rate [1][2][3][4][5]. As one of the major tasks in health monitoring, gear fault diagnosis (GFD) aims to assess the current gear state based on the obtained measurement data, then to inform users to take proper actions [6]. A GFD procedure generally consists of three main processes: (1) Data acquisition: data are collected from sensors to monitor the health status of gears; (2) Feature extraction: some feature extraction algorithms, such as wavelet transform (WT) [7] and least squares support vector machine (LSSVM) [8], are carried out based on the prior knowledge to provide recognizable features; (3) Fault recognition: classifiers are built to obtain gear faults with the analysis of the extracted features.
Clustering technology, one of unsupervised fault recognition approaches, has experienced long term development from partition based clustering to graph theory based clustering as listed in Table 1. Most of these algorithms were applied to fault diagnosis of rotating machinery. For instance, Yuwono et al. [9] combined particle clustering with a Hidden Markov Model (HMM) for bearing fault diagnosis; Pacheco et al. [10] classified gear fault severities using rough set theory. These researches have provided effective clustering applications related to machine fault diagnosis. However, they have a non-negligible limitation: each feature vector is treated as independent and uncorrelated unit in these clustering methods. In fact, strong correlation exists between machine working conditions and targeted diagnostic tasks. There is no doubt that clustering analysis in one dimension will lose significant information hidden in another dimension. To overcome this limitation, a co-clustering strategy for variable working conditions GFD is proposed in this paper. Coclustering was firstly utilized in the biology and medical domains since this concept was mentioned by Mirkin [11] in 1997. The joint clustering of genes shapes and locations promoted the discovery of genetic structure sequence [12]. Subsequently, co-clustering has been expanded to other fields such as text analysis [13] and search engines [14], etc.
In this study, co-clustering applications for gear fault diagnosis have been developed. Compared with previous GFD based on joint clustering methods, two highlights in this paper can be obtained: (1) when one varying working condition (such as rotating speed or load) and one diagnostic task (such as fault severity) are jointed in the same matrix, their correlations are extracted, and the classification accuracy of the latter can adjust with the range of the former; (2) when two diagnostic tasks (such as fault severity and fault type) are jointed in the same matrix, they can be classified at the same time, which improves the diagnosis speed compared with independent GFD strategy and offer a fast multi-tasking solution.
The remainder of this paper is organized as follows. Section 2 presents a brief summary about the applicability of co-clustering. It also describes the principle and basic framework of co-clustering to solve the GFD problem. Section 3 addresses the preparatory work of GFD, especially a short-time Fourier transform (STFT)-based feature extraction method. In Section 4, the co-clustering numbers are estimated based on the Bayesian information criterion (BIC). Then in Section 5, the traditional and modified NMF-theoretic co-clustering process is discussed in detail. To assess these algorithms, the gradient ascend algorithm is also implemented for parameters regulation in Section 6. Section 7 concentrates on the varying working condition GFD experiments using the Drivetrain Dynamics Simulator (DDS) system, which especially shows the superiority of co-clustering compared with classical clustering strategy such as X-means algorithm. Conclusions are drawn in Section 8 with discussions on the future GFD application based on joint clustering.

Co-clustering Framework of GFD
Traditional clustering can be defined as: dataset X exists in a limited data space, which can be represented with a n × d matrix, is composed of n elements: The purpose of general clustering is to segment dataset X into p categories: Different from general clustering, co-clustering could be defined as: dataset X exists in a limited data space, which can be represented with a m × n × d matrix, is composed of m × n elements: The purpose of co-clustering is to segment dataset X into p and q categories in horizontal and vertical axis, respectively: C h k (k = 1, 2, · · · , p) and C v k (k = 1, 2, · · · , q).
where i ∈ {1, 2, . . . , n} , j ∈ {1, 2, . . . , m}. Currently, one challenge for fault diagnosis of gears is that they are often operated under the varying working conditions. To explain the influence of varying working conditions on the diagnosis performance, a typical example is given in Figure 1 to compare the classification results between classical and co-clustering strategies. In the extracted feature matrix, different colors represent different feature value. C1-C5 represent different fault categories; D1-D5 represent different working conditions. R1-R9 represent the 1st clustering result; r1-r5 represent the 2nd clustering result. It can be seen that the classical clustering results may be distorted due to the fact that different working conditions have direct influence on the extracted features, and this makes difference between final clustering categories (R1-R9) and real fault categories (C1-C5). On the contrast, the co-clustering can reflect the actual diagnosis results (R1-R5), which shows the robustness of co-clustering in GFD under interference environment.
Notice that, co-clustering is not limited in the two dimensions. Theoretically, it can be generalized to higher dimension (n ≥ 3), thus giving an idea to solve
Graph theory-based ACODF [23], etc. the more complex problem such as the gear fault diagnosis problem under multiple working conditions. Figure 2 explains the mechanism of co-clustering GFD models under two working condition factors, such as varying rotating speed and varying load. In Model (a), the 2D classification matrix is structured in two dimensions: row for rotating speed and column for GFD task. In Model (b), the 3D classification matrix is structured in three dimensions: length for rotating speed, width for load and height for GFD task. Structured high dimension matrix is classified in each scale at the same time, which offers an idea for GFD under more than one working condition. According to the description above, a co-clustering framework is constructed for gear fault diagnosis, which is shown in Figure 3. The main process can be divided into four sub-frames.
• Feature extraction sub-frame: as the input of this model, the gear vibration signals are collected using several tri-axial accelerometers installed in monitored mechanical equipment, which may be operated in varying working condition environment.  Then, the feature vectors {F } are obtained using the short-time Fourier transform (STFT) approach, aiming to gain differentiable time-frequency features for effective co-classifier performance; • Clustering number estimation sub-frame: the Bayesian information criterion (BIC) strategy is adopted to characterize the distribution character of all feature vectors {F } and then estimate their co-clustering numbers k & l in row and column, respectively; • Co-clustering sub-frame: given co-clustering numbers, the conventional as well as modified NMFbased co-clustering classifiers are put into practice to build the varying working condition GFD models and get the classification results in various tasks; • Parameter regulation sub-frame: aiming to those adjustable parameters involved in BIC and the NMF algorithm, such as the weight factor and the transitional dimension d , the gradient ascent (GA) algorithm is implemented to find the optimal values, which reaches a reliable diagnostic accuracy.
More details of these four sub-frames will be given from Section 3 to Section 6.

STFT-based Feature Extraction
The performance of feature extraction algorithms depends critically on the characteristic of the input gear fault signals as well as the running environment of equipment. Because of the non-stationary property of the vibration signal, general algorithms, such as Fourier transform (FT), are not very useful in this regard [24]. As a kind of time-frequency analysis method, the main idea of short-time Fourier transform (STFT) is to execute Fourier transform in short sequential signals, cut by a sliding temporal window function γ (t) , such as a Hanning or Hamming window [25]. When analyzing the non-stationary vibration signal s(t), (t = 1, 2, . . . , t 0 ) , we supposed that it can be approximated as smooth among the window function γ (t) . Therefore, the STFT of s(t) is calculated by timefrequency units STFT (t, w) and is given by where τ means the position of window function γ (s) ; w represents the frequency parameter of STFT. Since the short time Fourier spectrum reflects a distribution of energies among all frequencies and temporal intervals, we have to seek those 'meaningful' values from the STFT graph. Without any known knowledge of the running equipment, an effective method is to find each maximum in the partitions of the STFT spectrogram to represent the feature of sub-window. Figure 5 gives an example of extracted features in the chipped gear in Figure 4, including 10 × 10, 40 × 40 and 80 × 80 features. Generally, the row & column divided dimensions depends on the non-stationary degree of signal in time and frequency domain, respectively. This figure indicates that more details are emerged in high dimension features compared with low dimension features. However, the increase of dimension will bring the expansion of computational load as well as time consuming. Therefore, a middle dimension is suitable if the definition and dimension are both acceptable, such as the 40 × 40 features in Figure 5 (2020) 33:16 where st ij represents coefficient at the location ( i, j ) int the sub-matrix of the STFT graph; The dimension N will be updated in the real GFD experiments in Section 7. Finally, the feature vectors from STFT are obtained and are given by

BIC-based Clustering Number Estimation
In most practical gear fault diagnosis applications, it is difficult to know the estimated number of clusters in advance, but the input clustering numbers always have direct influence in final clustering results. An optimal numbers estimation strategy based on the cluster validity indexes is found in historical literatures, including: the Calinski-Harabasz index [26], the Davies-Bouldin index [27], the weighted inter-intra index [28] and the in-group proportion index [29]. However, two drawbacks exist in the cluster validity indexes-based estimation strategy: (1) The estimation precision depends on selected clustering algorithm, dataset as well as the validity index. For instance, both using the IGP index, the affinity propagation (AP) method has more credible optimal clustering-number compared with the k-means strategy on account of the randomness of the latter [30]. (2) It is hard to apply the cluster validity indexesbased estimation strategy to 2D or higher dimension clustering like co-clustering because these validity indexes are mainly calculated according to the distance algorithms between different classes or within the same classes.
To find optimal numbers of co-clustering, an estimation algorithm based on Bayesian information criterion (BIC) is proposed. BIC is a statistical method which represents the descriptive power of a model to dataset [31], including: (1) the posterior likelihood of data estimation L ; (2) The model complexity |Θ| . The computational formula of BIC is given by where means the weight factor; N is the totality of samples. In clustering, the posterior likelihood of data estimation L is represented using the ratio of the mutual information entropy between after-clustering I(S * ; F * ) and before-clustering I(S; F ) . In Eq. (6), the entire meaning of L is that a good clustering must maintain original information entropy as possible as it can, But in 2D co-clustering, the BIC model requires to take row and column clustering into consideration at the same time. So we redefine the parameters as follows:

Direction
Sample length Sample size Clustering number According to these definitions, the BIC model complexity in co-clustering can be re-expressed as Substituting Eq. (7) and Eq. (8) into Eq. (6), we get Eq. (9) as follows: Further, this Bayesian information criterion can be extended to 3D field and is given by where p is the clustering number of the 3rd classification; q is the size of the 3rd dimension. Based on the description of theory above, the details of BIC algorithm is given as: Input: a. For single variable working condition GFD, the sample matrix C is structured by where x 1 · · · x 40 T represents the STFT values in continuous approximate one minute, which depends on the change speed of working condition; i ∈ {1, 2, . . . , n} represents the random samples collected from different fault type and n is the number of sample. b. For double variable working condition GFD, the sample matrix C is structured by  Process: a. Select an matrix construction method to build C matrix according to the number of running environments; b. Initialize the row clustering number k = 1 , the column clustering number l = 1 , the weight factor = 0.5; c. For k from 1 to m , l from 1 to n: where p s, f means the joint probability distribution between row and column; p(s) means the probability distribution in row vector; p f means the probability distribution in column vector. d. Search the max value of BIC during the whole k and l domain: Output: a. For single variable working condition GFD, it outputs the number of final clustering numbers k and l; b. For double variable working condition GFD, it outputs the number of final clustering numbers k , l , and p.

NMF Theory
Non-negative factorization (NMF) is a kind of efficient data compression strategy, aiming to describe the highdimensional data set using few base vectors with the help of the non-negative theory [32]. Different from the global (12) characteristics of vector quantization (VQ) and principle component analysis (PCA) theory, NMF offers a good description about the local features, so specializing in searching the small scale information. Non-negative factorization has been investigated for feature extraction and recognition of rolling element bearing fault [33]. However, the unique advantage of co-clustering has not been explored for 2D or higher dimensions application. The basic NMF problem is stated as the following equation: where C n×m is a n × m non-negative matrix; the basis matrix W and the gains matrix H are factorized from C n×m ; d < (n × m)/(n + m) represents the reduced rank. Therefore, C n×m can be linearly estimated by the sub-vectors W n×d and H d×m . In order to obtain matrix W and H , a large number of cost functions are defined to quantify the degree of approximation. In our strategy, the Euclidean distance is chosen as the cost function.
The purpose of NMF is to find the W and H , which possesses the smallest cost function: min where ⊗ is the element-wise multiplication, is the element-wise division; r represents the iteration.

Classical NMF-based Co-clustering
Recently, the clustering application based on NMF has attracted much attention. Particularly, KIM, etc., explored the effective combination between cluster and NMF [34]. This paper extends its application from single cluster to co-clustering, aiming to solve the varying work condition or multi-tasks problem. Rely on the computational W ∈ R n×r and H ∈ R r×m above, two objective functions J k & J l are defined as follows: where C r = [c 1 , c 2 , . . . , c k ] r ∈ R m×k and C 2 = [c 1 , c 2 , . . . , c l ] c ∈ R n×l represent the centroid matrix in row and column, respectively; The element c j , j ∈ [1, 2, . . . , k] of C r means the cluster centroid of the jth cluster in Task I (16)  (2020) 33:16 and the element c j , j ∈ [1, 2, . . . , l] of C c means the cluster centroid of the jth cluster in Task II; B r & B c denote clustering assignment in Task I and Task II respectively. In Task I, B r ij = 1 means the ith sample belongs to the jth cluster, otherwise B r ij = 0 , and so is Task II. The purpose of co-clustering is to find sparse matrix B r ij and B c , which has only one in each row, with others being zero. Taking Task I and Task II as example, we redefine C r and C c as Finally, a second order NMF is applied in J k and J l , aiming to factorize W to W (M r ) T and N r , to factorize H T to H T (M c ) T and N c . Therefore, the B r and B c matrix is obtained according to the second order NMF result. After that, the classifications in row and column are obtained from the B r and B c matrix.

Modified NMF-based Co-clustering
As described in Section 5.1, non-negative matrix C is factorized into two sub-matrices W and H in conventional non-negative factorization. Although the physical meanings of W and H are clear: they represent the (20) decomposition values in row and column respectively and promote the classification effect of co-clustering, the relation between two directions is still ill-defined. Hence, the typical NMF is improved and is given by where L k×l is named as 'the relation matrix', the value L ij represents the link between the ith cluster in Task I and the jth cluster in Task II; k is the clustering number in the row vector and l is the clustering number in the column vector. In modified NMF, the cost function and the update functions can be re-written as where ⊗ is the element-wise multiplication, is the element-wise division; r represents the iteration; W + is the generalized inverse of W : WW + W = W ; H + is the generalized inverse of H : HH + H = H. By introducing the matrix L , the W and H are not required to be orthogonal in modified NMF strategy. Therefore, it expands the optional range of W and H and improves the factorization performance, which will be proved in Section 7.

GA-based Parameter Regulator
Three parameters need to be designed in the co-clustering-based GFD strategy, including: (1) the feature-dimension N in STFT; (2) the weight factor in BIC algorithm; (3) the transitional dimension d in traditional NMFbased co-clustering, which are listed in Table 2.
Among these three parameters, the feature-dimension N in STFT can be decided from the GFD experiments. and d will be adjusted using the gradient ascent (GA) regulatory mechanism [35,36], whose fundamental is shown in Figure 6. The main idea of gradient ascent algorithm is to follow the fastest changing direction to find the maximum of diagnostic accuracy. In Figure 6, four different initial points of ( , d) are listed as example, where the approximation curves ① and ② reach the global optimum while ③ and ④ are limited in the local optimum.
Meanwhile, three concepts are included in the gradient ascent model: where y r i and y c i represents the label of the ith sample; zer(·) means the zero sign function. Notice that, the validity function can only be obtained in those training samples, whose classification labels are known.
The optimal and d/min(m, n) is gained according to the training samples and is used in others, called testing samples. It should be noted that in real GFD application, when performing GA algorithm: (1) If the step length is too large, the optimal parameter result might be skipped. But if the step length is too small, the iteration speed will be slow and cause too large computational load. (2) It is easy for the GA algorithm to be deep in the local optimum rather than the global optimum, which relies on the initial location of , d/min(m, n) . Therefore, it is necessary to take these two factors into consideration to balance the computational accuracy and the time consumption.

DDS Experimental System
The Spectra Quest's Drivetrain Dynamics Simulator (DDS) was used in this study for experimental verification, as shown in Figure 7. This system is composed of six units including: (1) speed regulator; (2) the driving motor; (3) the planetary gearbox; (4) the reduction gearbox; (5) brake device; (6) brake regulator. The faults occur in those gears in planetary & reduction gearboxes, under varying rotating speed and load conditions, which are adjusted using the speed regulator and the brake regulator, respectively. Four types of gear faults are studied: (1) root cracks; (2) missing teeth; (3) chipped teeth; (4) surface wear. The purpose of GFD is to classify these faults through 7 vibration sensors (3-axis for planetary gearbox; 3-axis for reduction gearbox; 1-axis for driving motor). In addition, in order to put the co-clustering methods into effect, we define different levels in four task sets listed in Table 3, including: (1) the fault type task (C1-C5); (2) the fault severity task (D1-D4); 3) the speed regulator task (E1-E5); (4) the load regulator task (F1-F5). Specially, the rotating speed curve and the load curve in Figure 8 was also conducted.

Experimental Setup
The varying rotating speed and load are designed using the regulator curves in Figure 8 for the experiment. In order to enlarge the data analytical ability of algorithms, 10 repeat collections were implemented to increase the sample points to 10 times (5120 × 50× 10) in each group, which are segmented by the 2.5 s sub-signals. Therefore, the sample number for fault type recognition can be gained in row clustering task (fault type) and column clustering tasks (rotating speed and load), and are listed in Table 4A and 4B.
In the fault type recognition experiments, several tests are compared using the models as follows: (1) X-means clustering; (2) Guassian Mixture Model (GMM)

Experimental Results and Discussion
Experiments are carried out, and the column task in Table 4 is considered to improve the effectiveness of row task in co-clustering method. In order to assess the quality of models, the Pr (C i ) and Re (C i ) indexes of clustering results are calculated and given by where the Pr (C i ) index reflects the ability that a model identifies correct samples while the Re (C i ) index reflects the ability that a model finds all correct samples. At first, we observed the fault type recognition results using the NMF-based co-clustering strategy under varying rotating speed and load environment, which are listed in Table 5. This table indicates that the diagnostic accuracy of co-clustering strategy in various load conditions (97.8%) is better than that in various rotating speeds (32) Pr Corrected classfied C i samples All classfied C i samples , Figure 7 The Drivetrain Dynamics Simulator (DDS) system: a the whole physical map of DDS system; b the internal structure image of DDS system; c transmission ratio of reduction gearbox Torque regulator (N·m) < 1.83 1.83-5.49 5.49-9.14 9.14-12.80 > 12.80 (2020) 33:16 (96.3%). This can be explained by two possible reasons: (1) The classify boundary of the latter is more clear than the former since that the rotating speed presents a gradual change characteristic from 0 Hz to 40 Hz while the brake load jumps from 0 to 14.63 N·m; (2) Changing the rotating speed has a stronger interference effect than just changing the load to collected vibration signal, which have been verified in our previous study. Secondly, to illustrate clustering performance in the varying rotating speed model, the NMF-based co-clustering results are listed in Table 6 (k = 6; l = 6) and Figure 9, where 200 samples are tested in each category. Some details can be seen here: (1) the misclassification cases always appear between 'Health' and 'Surface' or between 'Root' , 'Missing' and 'Chipped' because the time domain features of the former are more similar but the frequency domain features of the latter are alike. For example, the 'Chipped' type is easy to be classified as the 'Missing' type if the crack of chipped tooth is large enough; (2) The 6th category (R2 and R3) occurs in the C2 type when the number of clustering is set as 6, which means the discrete ability and the inconsistency exists inside the 'Root' samples. Interestingly, although the differences exist in local precision and recall index of different categories, the total precision and recall are very nearly the same in these two tables.   Especially, we reclassify the fault types of testing samples in Table 4 using the modified NMF-based co-clustering strategy. Here the normalized relation matrix L k×l (k = l = 5) can be gained and given by Notice that, the element l ij represents the relevance between the ith category in row task and the jth category in column task. In the L k×l matrix, the l ij rises up to approximate 7 in 'Missing' & 'Chipped' fault type with the increase of rotating speed. The effect of speed regulator on different fault types presents the following rank: Miss-ing≈Chipped>Root>Health≈Surface. However, there is a less link between load and fault types, seeing from the L k×l matrix under changing load environment. In order to further verify the necessity for modified NMF, Table 7 gives the evaluation indexes of this approach. It shows that an improved precision occurs in the varying rotating speed dataset but makes no difference in the varying load dataset. That is because the L k×l matrix in modified strategy cuts off the link between W and H in the former, which promotes the separation ability between row and column task. As can be seen in Table 7, the middle dimension of W and H is not limited in a single value, like traditional NMF method does, thus improving both the flexibility of selected dimension and the GFD precision (97.0% and 97.8%). Also, it can be known that the recognition performance of various loads (100%) is superior to rotating speed (95.3%) on account of the continuity of speed regulator.

Table 4 Sample number for (A) rotating speed fault recognition and (B) load fault recognition
Finally, concentrating on the varying working condition GFD, the traditional clustering strategies, including the K-means and the GMM methods, and co-clustering approaches were compared using two selected indexes: the precision and the time consumption. The performance comparison results are listed in Table 8. According to this table, although the time complexity of single algorithm of 1D clustering is smaller than joint strategy (4.923 s < 7.991 s), the accumulation of computational load for two tasks is larger than co-clustering proposed (4.923 s + 4.856 s > 7.991 s). Meanwhile, the co-clustering have an apparent precision increase in varying working condition GFD, about 12.51% in varying rotating speed and 7.00% in varying load. Therefore, this experiment proves the superiority of co-clustering in gear fault diagnosis under variable working conditions.

STFT Dimension Adjustment Experiments
During the STFT dimension adjustment experiments, we adjusted the feature dimension N from 10 to 80 one by one, and then the diagnostic precisions of Task I as well as the time consumptions of co-clustering model were observed and were drawn in Figure 10. It can be seen that the diagnostic accuracy increases from 78.76% to 97.54% when the feature dimension N increases from 10 to 80, meanwhile, the computational load indicates an exponential increase from approximately 16.78 s to 48.45 s. It can be seen from Figure 10 that N = 42 is considered as an appropriate dimension, in which the diagnosis accuracy is satisfactory enough (97.02%), while the time consumption keeps at a low level (25.12 s). Although the precision will continue to improve up to 97.54% if we  continuously increase the feature dimension, but computational cost will also increase quickly.

BIC Algorithm Experiments
Based on the dataset from Drivetrain Dynamics Simulator (DDS) system, we tried to adjust the number of cluster in row and column, respectively. Generally, the search range of clustering number is from 2 to 10: 2 ≤ k ≤ 10; 2 ≤ l ≤ 10. Figure 11 illustrates an example of the BIC results in varying load GFD experiments when the clustering number changes from 2 to 10. It can be seen that the peak of BIC value (− 8124) exists at the point (5,5), which means that the optimal co-clustering results happens when the row and the column clustering numbers both equal to five. According to the real dataset and the standard labels in Table 3, the BIC-based estimation algorithm satisfies the requirements of practical gear fault diagnosis applications. For further study on the BIC estimation algorithm, the BIC method was compared with a kind of traditional selfadapting classification estimation algorithm: X-means, which is an improved strategy of K-means. The estimations of clustering number in BIC as well as X-means algorithm are listed in Table 9. On one hand, for the BIC algorithm, with the increase of clustering number k, the diagnostic precision presents an increasing trend from 60.0% to 96.9% first, then it begins to slightly decrease when the clustering number k is larger than 5. On the other hand, by comparing the estimation results between X-means and BIC algorithm, we find the estimation of clustering number using the X-means is much larger than the normal (k = 12) and also its diagnostic precision is only 84.1%, which proves the superiority of BIC-based clustering number estimation.

GA Parameter Regulator Experiments
As shown in Figure 4, the gradient ascent algorithm was used in the varying working condition datasets to obtain the weight factor and the transitional dimension d . The regulator results of the gradient ascent algorithm are listed in Table 10. Here the final ( , d ), the number of iterations as well as the precision of row task are observed using 4 initial ( , d ) values: (0.5, 500), (0.5, 200), (0.7, 500), (0.7, 200). After gradient ascent, it could be seen that four initial values all reached the global optimum in varying load experiments. But in varying rotating speed experiments, only two points reach the global optimum, which achieves 99.3% diagnostic accuracy. In addition, it can be found that the less iterations occur when the distance between initial and final ( , d ) point is short. Therefore, these tests prove that the performance of GA algorithm depends on the selected initial point to a great extent.

Conclusions
A NMF-theoretic co-clustering strategy is presented in this paper to offer a fast multi-tasking solution to solve the gear fault diagnosis problem under variable working conditions. Here the time-frequency features are extracted from the STFT spectrogram, and are utilized to structure the 2D matrix for joint clustering. Experiments indicate that 97.02% diagnostic precision can be achieved when the STFT dimension is set as 42. Meanwhile, seeing from the results of the BIC-based optimal clustering number estimation, they are close to the practical categories, no matter in varying rotating speed or varying load dataset. After NMF, row and column clustering task can be identified at the same time, with approximately 10% improved accuracy and less time cost compared with those single task clustering algorithms, such as X-means and GMM algorithm. There is an internal connection in most of gear failure signals. The proposed co-clustering strategy has better performance than independent clustering strategy because the modified NMF helps to provide a relation matrix, which shows a strong correlation between different rotating speeds and fault types. Therefore, the NMFbased co-clustering has a good potential to apply in the gear fault diagnosis of large-scale rotating machines under varying working conditions. In the future, co-clustering with higher dimension will probably apply in the more complex working conditions or more diagnostic tasks to improve the gear fault diagnosis performance.

Authors' Contributions
RY designed the experiment, FS and CC analyzed the data. All authors read and approved the final manuscript.