In this section, a detailed introduction to the DFAWNet including FWConv, DHT, ISF, and the end-to-end fault diagnosis network is given below. The core design motivation is illustrated in Figure 2. The overall fault diagnosis framework of DFAWNet is shown in Figure 5.

### Fused Wavelet Convolution

As a DL-based implementation of the wavelet transform, FWConv attempts to address the problem of scale learning and basis selection by the learnable scale and adaptive fusion of various wavelet bases. The learnable scale can locate the appropriate frequency band under the optimization of corresponding loss. Meanwhile, the adaptive fusion of various wavelet bases could effectively improve feature extraction ability. The whole structure of FWConv is shown in Figure 6.

The first step is to realize the wavelet transform with a learnable scale. In Ref. [19], wavelet transform can be inverted into convolution with learnable parameters. Different from Ref. [19], the translation parameter *b* can be replaced by the stride parameter in 1-D convolution in this study. Consequently, an improved single-parameter wavelet convolution is realized:

$$W_{a} = x * \psi_{a} ,$$

(9)

where \(a\) is the learnable parameter of the filter kernel \(\psi\); \(W_{a}\) is wavelet coefficients computed under the scale *a*. Different channels correspond to different scales. For the sake of convenience, in the *c*th channel, wavelet coefficients can be denoted as \(W_{c}\) and wavelet basis is denoted as \(\psi_{c}\).

The second step is to alleviate the problem of basis selection. Different wavelet bases \(\psi\) are designed for extracting various features. In addition, a single wavelet basis usually cannot meet the requirement of fault diagnosis [28]. Thus, a simple idea is to prepare a set of wavelet bases and select important ones via the data-driven mechanism.

However, discarding the unselected wavelet bases will stop corresponding gradient backpropagation which is unstable for network learning.

A more proper implementation way for basis selection is to generate *C* new wavelet bases \(\psi^{\prime}\) from the original \(\psi\) based on linear weighting, called fusion:

$$\psi_{i}^{\prime } = \sum\limits_{n = 1}^{{C_{{\text{o}}} }} {p_{i,n} } \psi_{n} ,\quad i = 1, \ldots ,C,$$

(10)

where \(\psi^{\prime}\) denotes new wavelet bases; *C* is the number of generated new wavelet bases; \(C_{{\text{o}}}\) is the number of original wavelet bases \(\psi\); the weight *p*_{i} of the *i*th new wavelet basis satisfies \(\sum\limits_{n = 1}^{{C_{{\text{o}}} }} {p_{i,n} } = 1\).

When the weight *p*_{i} is one-hot, we realize the wavelet base selection. For a generalized definition, the weight *p*_{i} should be high for the selected wavelet basis \(\psi_{i}\) and low for the basis different from it:

$$p_{{i,n}} = {\text{Softmax}}( - t\parallel \psi _{i} - \psi \parallel _{2} ),\quad n = 1,2, \ldots C_{{\text{o}}} ,$$

(11)

where \({\text{Softmax}}( \cdot )\) guarantees \(\sum\limits_{n = 1}^{{C_{{\text{o}}} }} {p_{i,n} } = 1\); and *t* is a temperature parameter changing from 1 to \(T = 1e^{4}\) with the network training.

The basis \(\psi_{i}\) is selected according to the importance index. Inspired by different wavelet bases characterizing various features, a good importance index should be high if the wavelet basis is highly different from each other. Thus, based on KL-divergence [29], the importance index can be defined as:

$$\begin{gathered} I_{n} = \frac{1}{{C_{{\text{o}}} }}\sum\limits_{m = 1}^{{C_{{\text{o}}} }} {D_{{{\text{KL}}}} } (p_{n} ||p_{m} ) \hfill \\ \, = \frac{1}{{C_{{\text{o}}} }}\sum\limits_{m = 1}^{{C_{{\text{o}}} }} {\sum\limits_{l = 1}^{{C_{{\text{o}}} }} {p_{n,l} } } {\text{log}}\frac{{p_{n,l} }}{{p_{m,l} }},\quad n = 1,2, \cdots ,C_{{\text{o}}} . \hfill \\ \end{gathered}$$

(12)

Based on the importance index, we could select *C* important bases and generate their corresponding weight *p*. Then we generate new bases and realize the fused wavelet convolution:

$$\begin{gathered} W_{i} = {\text{FWconv}}(x,\psi^{\prime}_{i} ) = x*\psi^{\prime}_{i} \hfill \\ \, = x*\left(\sum\limits_{n = 1}^{{C_{{\text{o}}} }} {p_{i,n} } \psi_{n} \right),\quad i = 1, \cdots ,C. \hfill \\ \end{gathered}$$

(13)

### Dynamic Hard Thresholding

As a DL-based implementation of thresholding, DHT is proposed to address the problem of threshold function design in traditional thresholding methods. Similar to the traditional threshold function, the feature discriminator can give the decision to either keep or remove values of coefficients. Then a reparameterization trick module translates the decision into an optimizable hard thresholding operation. The overall structure of DHT is shown in Figure 7.

According to Eq. (3), the key point of hard thresholding is to determine which coefficients should be kept (feature) and which should be removed (noise). In DL, this is equivalent to a binary classification problem. Thus, the thresholding can be formulated as:

$$\hat{W} = W \odot H,\quad H = \left\{ {\begin{array}{*{20}r} \hfill {0,\quad \phi < 0.5,} \\ \hfill {1,\quad \phi \ge 0.5,} \\ \end{array} } \right.\quad \phi \in (0,1),$$

(14)

where \(H \in {\mathbb{R}}^{L}\) is the operation of removing or keeping and *L* is the length of the wavelet coefficients; \(\phi\) is the output of the feature discriminator.

The first step is to design the feature discriminator. According to Ref. [30], a threshold function design should consider the inter-scale and intra-scale dependency of the coefficients. For the intra-scale dependency, we should consider the local information of the neighboring coefficient. Then the intra-scale feature is extracted by a simple convolution:

$$[\iota_{0} ,\iota_{1} ] = {\text{conv}}(W),$$

(15)

where \(\iota_{0} \in {\mathbb{R}}^{L}\) is output for the decision to keep the coefficient; \(\iota_{1} \in {\mathbb{R}}^{L}\) is output for the decision to remove the coefficient; \({\text{conv}}( \cdot )\) is a 1-D convolution.

As for the inter-scale dependency, the global average value is used to represent the characteristic of each scale. Then the dependency of different scales is extracted by the fully connected layer:

$$[\gamma_{0} ,\gamma_{1} ] = {\text{fc}}({{\sum\limits_{i = 1}^{L} {W_{ci} } } \mathord{\left/ {\vphantom {{\sum\limits_{i = 1}^{L} {W_{ci} } } L}} \right. \kern-0pt} L}),$$

(16)

where \(\gamma_{0} \in {\mathbb{R}}^{L}\) is output for the decision to keep the coefficient; \(\gamma_{1} \in {\mathbb{R}}^{L}\) is output for the decision to remove the coefficient.

Then we obtain the final output of the feature discriminator with the \({\text{Softmax}}( \cdot )\):

$$\phi_{i} = {\text{Softmax}}(\iota_{i} + \gamma_{i} ),\quad i = 0,1,$$

(17)

where \(\phi_{0} \in {\mathbb{R}}^{L}\) represents the decision to keep the coefficients (used as \(\phi\) in Eq. (14)); \(\phi_{1} \in {\mathbb{R}}^{L}\) represents the decision to remove the coefficients.

The next step is to translate the feature decision \(\phi_{0}\) into an optimizable hard thresholding operation *H*. In the inference stage, this feature decision could be directly converted to hard operation by logical judgment. However, in the training phase, directly turning the decision into the hard thresholding operation will result in the loss of gradient for the feature discriminator and probabilistic randomness (probabilistic randomness of the decision is helpful for training when the discriminator is not well trained). These problems can be tackled by the Gumbel-Softmax reparameterization trick [31]:

$$h_{i} = {\text{Softmax}}((\log (\phi _{i} ) + g_{i} )/\varepsilon ),\quad i = 0,1,$$

(18)

$$H = {\text{re}}((1 - \mathop {\arg \max }\limits_{i} (h_{i} )) - h_{0} ) + h_{0},$$

(19)

where *g*_{0} and *g*_{1} are i.i.d samples drawn from Gumbel \((0,1)\) distribution; \(\rm{\epsilon}\) is a parameter that controls the smoothness of distribution *h*_{i}, and it’s set to 0.66 as this is a binary classification; \((1 - \mathop {\arg \max }\limits_{i} (h_{i} ))\) returns the sampling result by Gumbel-Softmax; \({\text{re}}( \cdot )\) denotes the reparameterization trick that keeps the gradient backpropagation for *h*_{0}. Consequently, in the training stage, \(H\) can be denoted as:

$$H = \left\{ {\begin{array}{*{20}l} {h_{0} > 0.5 \Leftrightarrow 1 - \mathop {\arg \max }\limits_{i} (h_{i} ),\;{\text{forward,}}} \hfill \\ {h_{0} ,\;{\text{backward,}}} \hfill \\ \end{array} } \right.$$

(20)

where forward denotes the feed-forward stage; backward denotes the backpropagation stage.

Note that dynamic hard thresholding is executed by the optimizable operation *H*. It provides us with a new perspective to constrain the denoising process from the denoising ratio. The denoising ratio *r* is defined as the proportion of wavelet coefficients that are set to zero in the forward propagation:

$$r = 1 - \frac{\sum H }{L}.$$

(21)

A denoising ratio loss \(L_{{{\text{DR}}}}\) is designed for controlling the denoising ratio:

$$L_{{{\text{DR}}}} = (r_{{{\text{actual}}}} - r_{{{\text{target}}}} )^{2} ,$$

(22)

where \(r_{{{\text{actual}}}}\) is the actual denoising ratio during the training phase; \(r_{{{\text{target}}}}\) is the desired denoising ratio set up previously.

A large denoising ratio implies a strong denoising capability. However, an excessive denoising ratio may eliminate some useful information. Since residual connection can preserve the original signal to stabilize the noise reduction process, the DHT is finally implemented as:

$$\hat{W} = {\text{DHT}}(W) = W \odot H + W,$$

(23)

where the extra identity mapping empirically stabilizes the denoising process and reduces the sensitivity of the hyperparameter setting.

### Index-Based Soft Filtering

As a DL-based implementation of index-based filtering, ISF attempts to address the problems of filter optimization and index designing. Firstly, an index-based loss is constructed for filter optimization. Then the soft filtering selection module selects the optimal filter from those optimized filters based on an adaptive index. The structure of ISF is shown in Figure 8.

Firstly, we design the loss for filter optimization. According to Ref. [27], SK can help in designing more sophisticated filters. Thus, a simple index-based loss can be designed based on Eq. (6):

$$L_{{{\text{SK}}}} = - \frac{P(e)L}{C}\sum\limits_{i = 1}^{C} {\frac{{\sum\limits_{l = 1}^{L} {\tilde{W}_{i} } (l)^{4} }}{{(\sum\limits_{l = 1}^{L} {\tilde{W}_{i} } (l)^{2} )^{2} }}} ,$$

(24)

where \(\tilde{W}\) is the envelope of \(\hat{W}\); \(P( \cdot )\) is a cos function dynamically changed from 1 to 0 according to the current epoch *e*.

The index-based loss with the coefficient \(P(e)\) enables fast filter optimization via the prior SP knowledge in the early stages of training. Combined with a task-based diagnosis loss, the data-driven mechanism finetunes the filter design in the later stages. After optimization, there are *C* wavelet filters corresponding to outputs in *C* channels.

The next step is to select the optimal filter from the optimized *C* filters. Defined as hard filter selection in this paper, traditional index-based filtering often employs only the optimal filter and discards other filters. Each filter has a corresponding output. Thus, filter selection is equal to the corresponding output selection:

$$W_{h} = \hat{W} \odot \omega ,$$

(25)

where \(\omega \in {\mathbb{R}}^{C}\); hard selection \(\omega_{i}\) for each channel satisfies \(\omega_{i} \in \{ 0,1\}\) and \(\sum\limits_{i = 1}^{C} {\omega_{i} } = 1\).

However, deep learning performs well because it can compose different features of one layer [32]. Instead of hard filter selection, we proposed the soft filter selection to keep all features and implicitly select the optimal filter. In terms of the constraint to \(\omega_{i}\), soft filter selection is a generalization of Eq. (25). Soft selection \(\hat{\omega }_{i} \in (0,1)\) guarantees that all channels can be combined in the next layer. Furthermore, the condition \(\sum\limits_{i = 1}^{C} {\hat{\omega }_{i} } = 1\) is removed for a more flexible channel selection. In soft selection, a high value of \(\omega_{i}\) should corresponds to a channel containing more fault features. Actually, this is what the index should be in the traditional index-based filtering method. Then the problem is to design an appropriate index.

As energy is widely used for constructing a frequency band selection index in wavelet transform, we construct the new index based on the energy. Note that *N* types of wavelet bases in FWConv lead to *N* energy characteristics, it's supposed to divide channels into *N* groups and calculate their relative indices. However, FWConv fuses important wavelet bases dynamically and thus grouping calculation is complicated and resource-consuming. A simple idea is to calculate a local relative index using convolution. Based on the channel energy, the index (soft selection) can be formulated as:

$$\hat{\omega } = {\text{sigmoid}}({\text{conv}}(E)),$$

(26)

where \({\text{sigmoid}}( \cdot )\) represents the activation function that scales output value into \((0,1)\); \(E = [E_{1} , \cdots ,E_{C} ] \in {\mathbb{R}}^{C}\) represents the channel energy; \(\hat{\omega } \in {\mathbb{R}}^{C}\) is the index for each channel.

Finally, the ISF is defined as follows:

$$O = {\text{ISF}}(\hat{W}) = \hat{W} \odot \hat{\omega } + \hat{W},$$

(27)

where \(O \in {\mathbb{R}}^{C \times L}\) are the output features; the extra identity mapping stabilizes the training of the previous part (FWConv and DHT).

### End-to-End Denoising Fault-Aware Wavelet Network Architecture

As shown in Figure 5, DFAWNet is composed of FWConv, DHT, ISF, and a general CNN classifier. We refer to the combination of FWConv, DHT, and ISF as the denoising fault-aware wavelet (DFAW) module. In this paper, the CNN classifier is a relatively shallow 1-D structure modified from Ref. [33].

Firstly, with the noisy raw vibration signal *x* as the input, the DFAW module extracts more discriminative and robust features based on wavelet denoising and index-based filtering:

$$O = {\text{DFAW}}_{\nu } (x),$$

(28)

where \(\nu\) denotes learnable parameters of the DFAW module. Then the features are fed into the rest of the DFAWNet, a model \(g_{\theta } ( \cdot )\) parameterized by \(\theta\), to predict the health state \(\hat{y}\):

$$\hat{y} = g_{\theta } (O),$$

(29)

where \(g_{\theta } ( \cdot )\) represents the classifier consisting of 1-D convolutional layer, BN layer, ReLU (ReLU is a nonlinear activation function), max pooling layer, and FC layers. The detailed structure is shown in Figure 5.

For a diagnosis task with \(N_{{{\text{class}}}}\) categories, the cross entropy loss is:

$$L_{{{\text{cls}}}} = - \sum\limits_{i = 1}^{{N_{{{\text{class}}}} }} {y_{i} } {\text{log}}(\hat{y}_{i} ),$$

(30)

where \(y_{i}\) is the label of the *i*th class.

Considering that there are two extra loss functions from DHT and ISF, the total loss for the overall DFAWNet is:

$$L = L_{{{\text{cls}}}} + \alpha L_{{{\text{DR}}}} + \beta L_{{{\text{SK}}}} ,$$

(31)

where *α*, *β* are trade-off parameters, which can be determined by grid search and other hyper-parameter search methods [34].

The DFAWNet parameters are estimated end-to-end by solving the following supervised classification problem:

$$\nu^{*} ,\theta^{*} = \mathop {\arg \min }\limits_{\nu ,\theta } L(\hat{y},y).$$

(32)

As shown in Figure 5, the fault diagnosis framework of DFAWNet consists of 3 steps: (1) Segment acquired signals into fixed-length samples. Divide them into a training set and a test set. (2) Train the DFAWNet with the training set. (3) With the trained network, predict the health state by samples in the test set. As the proposed method only use raw vibration signals as inputs, this is an end-to-end fault diagnosis framework.