 Original Article
 Open Access
 Published:
MultiScale Convolutional Gated Recurrent Unit Networks for Tool Wear Prediction in Smart Manufacturing
Chinese Journal of Mechanical Engineering volume 34, Article number: 53 (2021)
Abstract
As an integrated application of modern information technologies and artificial intelligence, Prognostic and Health Management (PHM) is important for machine health monitoring. Prediction of tool wear is one of the symbolic applications of PHM technology in modern manufacturing systems and industry. In this paper, a multiscale Convolutional Gated Recurrent Unit network (MCGRU) is proposed to address raw sensory data for tool wear prediction. At the bottom of MCGRU, six parallel and independent branches with different kernel sizes are designed to form a multiscale convolutional neural network, which augments the adaptability to features of different time scales. These features of different scales extracted from raw data are then fed into a Deep Gated Recurrent Unit network to capture longterm dependencies and learn significant representations. At the top of the MCGRU, a fully connected layer and a regression layer are built for cutting tool wear prediction. Two case studies are performed to verify the capability and effectiveness of the proposed MCGRU network and results show that MCGRU outperforms several stateoftheart baseline models.
Introduction
The development of Prognostic and Health Management (PHM) has motivated the research in the field of machine health monitoring to detect faults and predict machine’s future conditions [1,2,3,4]. In modern manufacturing system, the worn tool is harmful for the metal cutting process and often causes additional costs [5]. The cutting tools will gradually become blunt during the manufacturing process, as shown in Figure 1, because of a lot of factors like abrasion, deformation and attrition. As a result, the quality of the products will be degraded. It is therefore crucial to monitor and predict the cutting tool wear online so as to prevent the quality from degradation [6].
Aiming at monitoring the working conditions of cutting tools and predicting tool wear, many methods, direct or indirect, online or offline, have been researched. Traditionally, by performing cutting tests under different working conditions, data about the cutting tools are acquired and then analyzed with the help of optimization techniques including the response surface methodology (RSM) and the design of experiments (DOE). This approach is timeconsuming and inefficient because the number of the tests required is large [7]. The finite element method (FEM) [8,9,10] has also been used in different cutting tasks [11, 12] and to predict cutting tool wear. Over the last two decades, methods based on deep learning and neural networks have started to be used for the estimation and prediction of cutting tool wear. Ko et al. [13] designed an autoregressive model followed by a highly parallel neural network to monitor the cutting state. Özel et al. [14] used neural networks for the prediction of cutting tool wear and its surface roughness. Ghosh et al. [6] designed a sensor fusion model with the help of neural network to extract and fuse features from various signals for the estimation of cutting tool’s average flank wear. By using Adaptive Neuro Fuzzy Inference System, Sharma et al. [15] developed a method for tool wear estimation. Venkata et al. [16] fed the cutting speed, the radius of nose and the volume of removed material to a multilayer perceptron model for the prediction of amplitude vibration, surface roughness, and tool wear. Zhao et al. [17] designed a convolutional bidirectional LSTM networks to monitor machine health, as well as to predict the tool wear depth. In general, researchers divide these methods into two major categories: physicsbased methods and datadriven methods. In tasks of tool wear prediction, physicsbased methods based on grey models and particle filters [18] have proven to be effective. However, these methods usually require accurate and highquality domain knowledge, which is often unavailable under complex and noisy working conditions. Moreover, most of them are unable to be upgraded with online data. Datadriven methods are now more attractive because they are able to address these issues. Deep learning theories and large amounts of data collected by advanced sensors have promoted the development of datadriven online methods. Two phases are usually included in datadriven models [19], where the first phase is to train models with collected data and then the other phase is to apply the trained models to online data to monitor the conditions or make predictions. The key of these two phases is that deep learning theories enable the model to better extract features and derive representations of machine conditions hidden in the data, and therefore enable it to make better predictions based on online data. In this paper, we research datadriven methods with the help of deep learning theories to predict cutting tool wear.
Datadriven methods take single or multiple sensory data as input, feeding them into training models to extract features and learn representations. Online data will then be fed into the well trained models to make predictions. Data and models are two core parts of datadriven methods. Figure 2 shows the basic framework of datadriven methods. The raw time series data collected by sensors are in sequential forms, whose sequential characteristics are difficult to be discovered by previous work focusing on developing models to extract multidomain features. These models, trying to extract statistical, frequency and timefrequency features, require intensive expert knowledge and feature engineering. Some models, such as the Markov models and Kalman filters [20,21,22], are capable of addressing sequential data, but they are not good at capturing longrange dependencies. It is important to capture connections and information in time scale because in real working conditions, the features are often submerged by heavy background noise, which will cause failures in these models. The development in the field of neural networks and deep learning has offered solutions to address these issues, and one of these solutions is Recurrent Neural Network (RNN). Traditionally, neural networks deal with inputs and outputs independently, which is not so reasonable in some sequential tasks. RNN is proposed to make use of information in arbitrary long sequences, and to capture the calculated information. However, the problem of gradient exploding and vanishing in traditional RNN weakens its power. Some improved variants of RNN have been designed to solve this problem, and one of them is LSTM, namely Long ShortTerm Memory Network. LSTM is good at solving problems that in need of information about previous events [23], which means that it is better at addressing sequential data of various length and capturing longterm dependencies. LSTM needs sufficient data to train while in real working conditions, while there may be no sufficient labeled data. Gated Recurrent Unit (GRU) network performs better under such situations. Proposed in 2014, GRU [24] is a more efficient variant of LSTM that shares many similar properties. With comparable performance to LSTM on sequence modeling, GRU has fewer parameters and is easier to train. Here, we introduce GRU networks to be one part of our network architecture. As one type of neural networks, GRU is able to extract features and learn representations without expert or domain knowledge, but it may be not robust because of the existence of noise in the raw sensory data. Compared with GRU, Convolutional Neural Network (CNN) is more robust when the data has noise interference. The convolutional operations in the CNN are able to extract abstract features by applying learnable filters to the convolutional layers to convolve with sequential data. For this reason, in Ref. [17], Zhao et al. adopted a onelayer CNN as a local feature extractor. However, as the information hidden in the sequential data is complicated and diverse, this local feature extractor with only one kernel size cannot extract all of the useful information. To address this issue, filters of different sizes are adopted to form a multiscale convolutional layer to extract different significant features. Here, we use this multiscale convolutional network to extract hidden but important features and then these features will be concatenated into a single feature map.
In this paper, we propose a model combining multiscale CNN and GRU named Multiscale Convolutional Gated Recurrent Unit Network (MCGRU) to predict cutting tool wear. In this model, the multiscale CNN consists of six parallel branches, and they are independent of each other. These branches are able to extract local features, as well as abstract ones from high level. Then these feature maps will be merged into a single feature map. Temporal information is encoded and representations are learnt in a twolayer GRU network, built on the top of the merge layer. We experiment on an open source dataset from cutters of highspeed Computer Numerical Control (CNC) milling machine, containing acoustic emission data, accelerometer data and dynamometer data [25]. Additionally, another experiment of CNC tool wear is carried out, through which current and vibration data and tool wear depth are sampled. Based on these sequential data and their corresponding tool wear depth, we compare the predicting ability of our model with that of several stateoftheart models.
This paper is organized as follows. In Section 2, we review some related work about CNN and GRU. Based on classic CNN and GRU, the MCGRU is designed and its details are presented in Section 3. Two case studies on the prediction of tool wear are conducted in Section 4. More details about the model and future steps are discussed in Section 5. Finally, conclusions are presented in Section 6.
Review of Related Work
Convolutional Neural Network
Convolutional Neural Network has proven very powerful in many recognition and classification tasks [26,27,28]. It has also shown the power to address sequential data in task of natural language processing [29,30,31]. In the convolutional layers, filters slide over sequential data to extract features and filters in the pooling layers will focus on the most salient ones. Additionally, the training process can be sped up and the model’s performance can be improved by adding batch normalization layers [32]. The capability of CNN can be further improved by stacking the above layers to build a “deep” CNN. Besides, the width of CNN can also influence its performance. In the inception module [33], parallel branches consisting of convolutional and pooling layers with different kernels are designed. This architecture allows the model to recover both local features via kernels of smaller sizes and high abstract features via that of larger sizes.
In our MCGRU network, an architecture of six parallel branches of CNN is designed to process the input sequential data before it is fed to GRU units. Kernels with different sizes are adopted in different branches to extract local and abstract features at the same time. The model itself is going to determine which features are significant to be chosen.
RNN, LSTM and GRU
Recurrent Neural Network (RNN) is mainly proposed to handle long term dependencies while processing sequential data in task of natural language processing. The hidden states in RNN use the outputs of the previous states as the inputs of the next states, which means that the sequential information is preserved. As weights are shared across time, RNN is able to process sequential input of any length. However, the problem of gradient exploding and vanishing emerged as the major obstacle to traditional RNN’s performance. To avoid this problem, Long ShortTerm Memory Network was proposed by Sepp and Jürgen [34] in 1997 and was improved by Felix Gers’ team [35] in 2000. It is able to prevent backpropagated error from vanishing [36] and memorize a state for different time periods with the help of the input gate, forget gate and output gate, which manage the flow of information in the network. As another variant of RNN, Gated Recurrent Unit (GRU) was introduced to solve the vanishing problem. With only a reset gate and an update gate, GRU has comparable performance to LSTM. However, there are fewer parameters in GRU because it lacks an output gate and has less complex structure, which means that it is more efficient and can be used under situations where there is no sufficient data. Considering the effectiveness of GRU, it has been more and more widely used to learn significant representations in time series data.
In the proposed MCGRU network, twolayer GRUs are adopted to process the output of the multiscale convolutional layers. Effective representations will be learnt here and to be used in the prediction of tool wear.
Neural Networks and Tool Wear Prediction
Neural networks have been successfully used in tasks of machine condition monitoring like tool wear prediction because of their excellent features extraction and representations learning capabilities [37,38,39,40,41,42]. Artificial Neural Networks (ANN) were firstly adopted and were proved to have good performance in the machine condition monitoring tasks. However, with more and more interference and as the working conditions become more and more complex, ANN is no longer good at solving these problems. As a result, Convolutional Neural Networks (CNN) was introduced in this field. The depth of the networks and their learning ability enable themselves to learn what they need in these tasks. However, most of these models make use of the features manually extracted and designed from raw data, while ignoring the representations and the relations of different time steps hidden in the sequential data. For tool wear prediction, the information about the condition of cutting tools remains to be discovered. Our proposed MCGRU combines multiscale CNN with GRU to learn features and representations without the intervention of human designed features, which the information behind the raw sensory data can be explored as much as possible and the prediction accuracy will be improved.
Model
What we desire is an arm exoskeleton which is capable of following motions of the human upperlimb accurately and supplying the human upperlimb with proper force feedback if needed. In order to achieve an ideal controlling performance, we have to examine the structure of the human upperlimb.
Before presenting the MCGRU network, some notations used in this paper are clarified here. The task is to design a model for tool wear prediction based on the multiple inprocess sensory data. A labeled time series dataset is given as \(D = \left\{ {\left( {{\varvec{x}}_{i} ,y_{i} } \right)} \right\}_{i = 1}^{N}\), which contains N tool conditions, and their corresponding labels \(y_{i}\), i.e., each tool condition corresponding to a tool wear that is measured and recorded as \(y_{i}\). Assuming that in each tool condition \({\mathbf{x}}_{i}\), \(q\) channels of sensory data are sampled and the length of each channel of sensory data is \(L\). For each channel, the whole sequence is divided into \(l\) sections, i.e., \(l\) time steps. The ith cutting tool condition is
where vector \({\varvec{x}}_{i}^{t} \in R^{d}\) is the multiple channels of sensory data sampled at time step \(t\), i.e., the tth section, and \(d = q * \left( {L/l} \right)\) is the dimensionality of \({\varvec{x}}_{i}^{t}\), and \(\left( \cdot \right)^{{\text{T}}}\) represents the transpose. The goal is to predict the tool wear \(\hat{y}_{i}\) through \({\varvec{x}}_{i}\). In our proposed Multiscale Convolutional Gated Recurrent Unit, the Multiscale Convolutional Network functions as a feature extractor and the Gated Recurrent Unit functions as a temporal encoder. Six parallel and independent branches of Convolutional Neural Network consisting of different kernels are designed to process the raw sensory data. Local and abstract features extracted are fed into a merge layer, on the top of which is a twolayer GRU designed to learn significant representations. Finally, the prediction is performed by a fully connected layer and a regression layer. The MCGRU network is shown in Figure 3.
MultiScale CNN
In each branch of the multiscale CNN, a fivelayer CNN is adopted, which consists of two convolutional layers, one maxpooling layer and two batch normalization layers. In the first convolutional layer of each branch, the kernel size equals to 1. The adoption of this convolutional layer is not only able to help extract more significant features, but also able to reduce the parameters of the model. For example, in onedimensional convolution, a CNN containing one convolutional layer with kernel size equaling to 7 has more parameters than that containing two convolutional layers with kernel sizes equaling to 1 and 7, respectively. A batch normalization layer is adopted at the top of the first convolutional layer. Batch normalization layers in hidden layers help to accelerate the training and augment the predicting accuracy. Then, the output of the batch normalization layer is fed into the second convolutional layer. The second convolutional layers with different kernels in different branches extract multiple time scale features hidden in the sequential data. Small kernels are able to extract local features, while large kernels are able to extract abstract features. Based on these multitimescale features, the model itself learns to determine which ones should be concerned about. The last two layers are another batch normalization layer and a maxpooling layer. The maxpooling layer compresses the previous feature maps to further learn more significant features. Then in a merge layer, all of the feature maps from different branches are concatenated into a single feature map. All of the features extracted from branches are reserved. The organization of these two kinds of structure is shown in Figure 4. Details are presented in the following contents respectively. Here we take the operations in one branch as example, and operations in other branches are the same.
Convolution
In the convolutional layer of each branch in the Multiscale CNN, the 1dimensional convolution operation is achieved by using a filter (kernel) \({\varvec{v}} \in R^{h \times d}\) to slide over \({\varvec{x}}_{i} \in R^{l \times d}\) to convolve with the subsection \({\varvec{x}}_{i}^{t:t + h  1} \in R^{h \times d}\) from time step \(t\) to time step \(t + h  1\). The \({\varvec{x}}_{i}^{t:t + h  1}\) is given as follows:
where \(h\) is the kernel size. Additionally, a bias term \(b\) is added to get the complete convolution operation, which can be given as:
where \(j \in R\) represents the jth filter v and \(\circ\) represents the Hadamard product.
As the filter slides over \({\varvec{x}}_{i}\) and the convolution operation is done, we get a vector \({\varvec{c}}_{j}\), which is given by:
where \(p\) is the amount of zero padding, \(s\) is the sliding stride of the kernel, and \(\left( {l  h + 2p} \right)/s + 1\) is the length of the output after convolution operation. When \(s\) and \(p\) are set, the length of the output depends on the kernel size \(h\). Different kernels size results in different output sizes. It is an important point because in each branch of our proposed Multiscale CNN, different kernel sizes are chosen.
Specially, to concatenate different outputs from different branches, it is more meaningful to get outputs with same size. Therefore, the trick of zero padding is adopted in the convolution operation. In different branches, different amounts of zero padding are adopted, which helps the output to have the same size as the input, no matter which kernel size is chosen. As a result, the feature map can be given by:
Batch Normalization
Instead of just normalizing the input of the CNN, we adopt batch normalization [32] layers to normalize the inputs within the network by using the variance and the mean of the values in the current minibatch. In the batch normalization layer, the operation can be represented as follows:
As batch normalization layer does not change the feature map’s size, we therefore get:
Activation Function
After the convolution and batch normalization operations, an activation function is added to bring in nonlinear properties and therefore to learn nonlinear complex arbitrary functional relationships between inputs and outputs. As a result, the convolution, batch normalization and activation operations can be together given by:
where \(f\left( \cdot \right)\) is an activation function. Here, we choose Rectified Linear Units (ReLU) [43] as the activation function in our proposed model.
The above three operations result in a feature map, which can be given by
MaxPooling
By introducing pooling layers in the network, the previous feature maps’ size can be further reduced and more significant and abstract features can be extracted. Here, we adopt maxpooling operation. In onedimensional pooling, with the pooling length \(k\), the maxpooling operation uses a kernel to slide over the feature map to get the max value over the \(k\) consecutive values. Here we let the sliding stride equal to \(k\), and as a result, the output of maxpooling operation can be given by:
where \(m_{j}^{i} = \max \left( {a_{j}^{{\left( {i  1} \right)s}} ,a_{j}^{{\left( {i  1} \right)s + 1}} , \cdots ,a_{j}^{{\left( {i  1} \right)s + k  1}} } \right)\).
Concatenation
In the concatenating layer, the feature maps from different branches will be concatenated into a single feature map to merge all the local and abstract features. Assuming that in the ith branch, the jth output of this branch is given by:
Then, the output of the concatenation layer is given as:
where \(N\) is the serial number of branch, and \({\mathbf{M}}_{i}\) can be represented as
where \(K_{i}\) is the number of output of the ith branch.
To summarize, in the Multiscale CNN, the shape of the input sequence is \(n \times l \times d\). Here, \(n\) represents the total number of working conditions. As descripted above, before the concatenating layer, the output shape of the ith branch is \(n \times \left( {\left( {l  k + 2p} \right)/s + 1} \right) \times K_{i}\). In different branches, kernel sizes from small to large help to extract local and abstract features. Compared to the original raw sequence, these multitimescale features can better represent the properties of the working conditions. As these features are merged in the concatenating layer, the following GRU is added to learn significant representations of the working conditions. To be more specific, the framework of the Multiscale CNN is illustrated in Figure 5.
Deep GRU
Under real industrial conditions, clean sample data is difficult to obtain. Compared to LSTM, GRU is better at dealing with such situations where there is no sufficient data. Here, on the top of the Multiscale CNN, a twolayer GRU network is designed to excavate vital representations from the multitimescale features. The deep GRU is presented as follows.
Gated Recurrent Unit
In GRU, the inputs are the hidden state \({\varvec{h}}_{t  1}\) at previous time step \(t  1\) and the data \({\varvec{x}}_{t}\) at the current time step \(t\), and the output is the hidden state \({\varvec{h}}_{t}\). The output \({\varvec{h}}_{t}\) depends on the previous hidden state \({\varvec{h}}_{t  1}\), the update gate \({\varvec{z}}_{t}\), the reset gate \({\varvec{r}}_{t}\) and the candidate hidden state \(\tilde{\user2{h}}_{t}\). The reset gate \({\mathbf{r}}_{t}\) enables the unit to drop any information in the hidden state that is less meaningful or irrelevant, so as to focus on the information that is more important. The update gate \({\mathbf{z}}_{t}\) determines the information from the previous and the candidate hidden state that can be passed to the current hidden state [23]. The relating equations can be given by:
where \(\sigma\) is the sigmoid activation function, \({\varvec{W}}_{z}\), \({\varvec{U}}_{z}\), \({\varvec{W}}_{r}\), \({\varvec{U}}_{r}\), \({\varvec{W}}_{h}\) and \({\varvec{U}}_{h}\) are shared weight matrices which are learned during training, \({\varvec{b}}_{z}\), \({\varvec{b}}_{r}\), \({\varvec{b}}_{h}\) are learnable biases. The basic structure of a onelayer GRU is shown in Fig. 6.
Deep GRU Gated
As mentioned above, the capability of a neural network can be improved by “going deeper” [44, 45]. In a deep neural network, there exists more nonlinear operations and more abstract features and representations can be learned. Inspired by this idea, we stack two GRU layers to get a deep architecture, in which each GRU layer contains different number of units. In the deep GRU, as shown in Figure 7, while the output of each hidden state in one layer propagating through time, it is also the input of the hidden state in the next layer. Features at low level are therefore learned and passed to the next layer to learn higherlevel representations. By stacking GRU layers, the network is able to learn essential representations at different time scales more effectively.
Fully Connected and Linear Regression Layer
The output representation of GRU network is flattened as h and then fed into a fully connected layer to be prepared for the linear regression layer. The operation of the fully connected layer can be given as follows:
where o is the output of the fully connected layer, W is the transformation matrix, b is the bias, and \(f\left( \cdot \right)\) is the activation function. We use ReLU here as the activation function. Finally, the fully connected layer’s output o is fed into a regression layer and the tool wear of the ith working condition is therefore predicted, which can be given by
Training and Regularization of MCGRU
The Mean Absolute Error (MAE) is adopted as the loss in the training process, which is given by:
where n represents the total number the samples.
The optimizer we adopt here is Root Mean Square Propagation (RMSProp) [46]. It is a very robust optimizer with pseudo curvature information. RMSProp is useful for mini batch learning because the gradients are normalized by the magnitude of the recent gradients, enabling it to handle stochastic objectives properly. RMSProp is a nice optimizer for recurrent neural networks like LSTM and GRU.
As mentioned above, GRU rather than LSTM is chosen in our proposed model because in real working conditions, there is usually no sufficient labeled data. When going deep and when there is no sufficient data, the network may be too complex to train and the problem of overfitting may appear. In order to solve this problem, regularization methods should be added within the network. Here, we adopt a Dropout [47] layer after the GRU network, as well as after the fully connected layer. Dropout layer enables the network to ignore those neurons that are randomly selected during the process of forward propagation. Therefore, the network will not rely too much on some local features. In our proposed model, we only use dropout during training process, but not in testing process, and the dropout ratio is set to be 0.3.
Experiments
Case 1: High Speed CNC Machine Tool Wear Dataset
Descriptions of Datasets
The first experiment is a high speed CNC machine running under dry milling operations [48]. This dataset is presented on the “prognostic data challenge 2010” database [25]. The experimental platform and the details are shown in Figure 8. In this experiment, six cutters are used to cut over an identical workpiece while each cutter made 315 cuts. When training and testing our model, six channels of data including forces and vibrations are used. A LEICA MZ12 microscope was utilized to measure the flank wear of each flute when the experiment was finished. The values of the wear were then taken as the target value.
In this dataset, six cutting tools are used to do the experiment, which means six collections of data (C_{1}, C_{2}, \(\cdots \), C_{6}) can be used. To compare with the results in Ref. [17], we adopt three cutting tools, i.e., three data collections C_{1}, C_{4} and C_{6} as our training and testing sets here. Each data collection contains 315 samples, corresponding to 315 tool wear. To make good use of this dataset, a threefold strategy is adopted. Among these three data collections, two of them are taken as the training set and the other is the testing one. As a result, we get three cases. For example, when C_{1} is testing set and C_{4}, C_{6} are training test, this case is denoted as c_{1}. The other two cases c_{4}, c_{6} can be deduced from the above example. The details of these three cases are shown in Table 1.
As the sampling frequency is too high, for each channel, the sampled sequence is divided by 512 to get several sections, and the first forty sections are used. As a result, each original sequence is transformed into a datum with a length of 40, and therefore at each time step, the dimensionality is 3072 (6 channels). As descripted above, in the training process, the input shape of the network is 630×40×3072 and in the testing process, that of the network is 315×40×3072.
Experiment Setup
The following models shown in Table 2 will be compared with our proposed MCGRU model. Regression models including LR, SVR and MLP, cannot process sequential data directly, and hence we firstly extract the related features. Here, ten features, containing statistical features, frequency features and timefrequency features are extracted from raw data. Details are shown in Table 3. As there are six channels of signals, the dimensionality of the input is 60. In LR, there is no hyper parameter. In SVR, the regularization parameter is set as 0.1 and the kernel is Radial Basis Function (RBF). As for the MLP, the parameters of three hidden layers are set as (140, 280, 900) and we choose ReLU as the activation function.
The other compared models are able to address sequential data directly. The input shape is therefore 40×3072. The fivelayer CNN has the same structure as one branch of our proposed MCGRU. The kernel sizes in the two convolutional layers are 1 and 7, quantities of kernels are 32 and 64, and the pooling size is set as 2. The setting of the MCNN (Multiscale Convolutional Neural Networks) is the same as that of our MCGRU. As for the basic recurrent models, including RNN, LSTM and GRU, the quantity of units is set as 192. And for the deep recurrent models, including Deep RNN, Deep LSTM, and Deep GRU, the quantity of the units in two layers is set as (180, 240). The CBLSTM, that is Convolutional BiDirectional LSTMs, is proposed by Zhao et al. [17]. Here, the same settings in [17] are adopted for this model. The CGRU (Convolutional GRU) has a fivelayer CNN with the same settings as the previous CNN model and a twolayer GRU with units (180, 240).
In our proposed MCGRU, from branch 1 to branch 6, the kernel size of the convolutional layers is set as (1, 1), (1, 3), (1, 5), (1, 7), (1, 9), (1, 11) , and the quantity of kernels is set as (32, 64). The kernel size of the pooling layer in all the branches is set as 2. Here, as the input shape is 40×3072, and the zero padding is adopted, the output shapes of each branch are the same, that is 40×32. Then, these six outputs are concatenated to get an output shape of 40×192. The quantity of units in the next two GRU layers is set as (180, 240) and the output units of the fully connected layer is set as 120. All of the activation functions in our model are Rectified Linear Unit (ReLU).
To evaluate the capability of the previous models, the Mean Absolute Error (MAE) and the Root Mean Squared Error (RMSE) are adopted. The MAE focuses on the average magnitude of the errors, without considering their direction. As a quadratic scoring rule, the RMSE measures the average magnitude of the error. The MAE is given in Eq. (20), and the RMSE is given by:
where \(\hat{y}_{i}\) represents the predicted tool wear and \(y_{i}\) is the actual tool wear.
These models are trained and tested using a Linux Server with two NVIDIA 1080Ti GPUs and a 4.2 GHz INTEL i77700K CPU.
Case 2: Experiment of the Reliability of CNC Machine Tool
Descriptions of Datasets
The second experiment is of the reliability test of CNC machine tool. It is carried out on a CNC machine, as showed in Figure 9. The cutting tool, as showed in Figure 10, is utilized to process a 45# steel bar, and the relating parameters are shown in Table 4.
As shown in Table 4, three channels of data are sampled, including the vibration signal, the AERMS signal and the current signal of the spindle motor. The tool wear corresponding to each working condition is also measured and recorded to be the label. Here, we did the experiment with 9 cutting tools and got 840 samples in total. All of the cutting tools have the same initial tool wear. Each sample corresponds to a working condition. To be different from the first experimental case, in this case, we randomly chose samples from all of the cutting tools as the testing set and the training set. Both the training set and the testing set contain samples from the nine cutting tools. As a result, the training set include 735 of the 840 samples, and the rest are used as the testing set. Similarly, as the sampling frequency is too high, for each signal, the sequence is divided by 256 to get several sections, and the first 20 sections are selected. Hence, each data sample is transformed into a sequential datum with a length of 20, and at each time step, the dimensionality is 768 (3 channels). As descripted above, in the training process, the input shape of the network is 735 × 20 × 768 and in the testing process, that of the network is 105 × 20 × 768.
Experiment Setup
The setup of this second experiment is almost the same as the first one. Same compared models and same settings for these models are adopted. The two indexes used to evaluate the performance of the models are the MAE and the RMSE. The models are trained and tested using a Linux Server with two NVIDIA 1080Ti GPUs and a 4.2 GHz INTEL i77700K CPU.
Results
In this section, the comparison based on the MAE and RMSE of the above models are shown. Table 5 shows the MAE of each model in the first experimental case, while Table 6 shows the RMSE. The MAE and RMSE of each model in the second experiment are shown in Table 7.
As shown in the tables, the regression models LR, SVR and MLP have shown their capability to make prediction of tool wear based on the features extracted from raw data, while they are not as good as the convolutional models and recurrent models. Linear Regression performs worst in this task because it is a linear model in nature, which cannot make full use of the extracted features to make predictions. The SVR and the MLP perform better because the nonlinearity is introduced into the models so that the relationships among different features can be better explored. In SVR, by adopting different kernels, the samples can be mapped to a high dimensional space. The RBF kernel we chose here shows its power to address regression tasks. The MLP is able to search an efficient mapping mode actively, which is effective and different from SVR.
However, compared to the above regression models, the convolutional models and recurrent models can address raw data to learn significant features and representations, which enables them to have better performance. By choosing different kernels, CNN is able to extract local or abstract features. In this task, MCNN performs better than CNN because it contains kernels of different sizes to extract local and abstract features in the same time. The depth and width we introduced here in these two models also help in predicting tool wear. Longterm dependencies in sequential data cannot be discovered by convolution operation, but it can be captured by recurrent models. As shown in the tables, in this task of addressing time series data, recurrent models perform slightly better than convolutional models. Here, LSTM and GRU are better than basic RNN because their gates enable them to be more powerful to capture longterm dependencies. What’s more, in this task, the amount of data is not large, which shows the GRU’s advantages under the situation where there is no sufficient data. Here, the GRU performs better than LSTM. And as expected, the three deep recurrent models perform better than the three normal ones. In Ref. [17], the proposed Convolutional BiDirectional LSTMs combines a local feature extractor CNN with a temporal encoder deep BiDirectional LSTMs, which is able to excavate useful features hidden in the raw sensory data in both forward and backward ways. The CBLSTMs performs better than most of the above convolutional and recurrent models. As for the CGRU, it performs well, as the GRU is also able to learn significant representations on basis of the features extracted by the CNN. Specially, in the second experimental case, the deep GRU and CGRU perform even better than the CBLSTMs, which shows the power of the GRU in dealing with small amount of data. Our proposed model, the MCGRU, performs best among these compared models.
The result reveals that there is much information hidden behind raw sensory data that cannot be discovered by human designed features, while Multiscale CNN is able to filter the noise from real working environment and explore the information as much as possible. The deep GRU is able to excavate the temporal information to find a more accurate relationship between the input and output, namely the raw sensory data and the predicted tool wear. As the network goes wider, more meaningful features of different time scales can be discovered, and as it goes deeper, the abstract and significant representations can be learnt. The combination of the multiscale features extractor Multiscale CNN and the temporal encoder deep GRU is therefore proven to perform well in the task of tool wear prediction.
To be more specific, the prediction of the tool wear, the corresponding actual tool wear, and the error between these two values are illustrated in Figures 11, 12, 13, 14. It is shown that in the first three figures, i.e., in the results of the first experimental case, the trend of the degradation of the cutting tool is robustly captured and error is acceptable. Specially, in Figure 14, nine ascending curves can be found in the curve of actual tool wear, that’s because in the second experimental case, we have sampled data from all of the nine cutting tools to be the testing set and each ascending curve represents the data from a cutting tool. In this case, the results are also satisfying. Moreover, for each epoch, it consumes about 1 s to train. When testing, it consumes only 0.8 s to predict the tool wear of about 300 samples, which means that our proposed model is efficient enough to be used in realtime prediction.
Discussion
In this section, we discuss the impact of the number of the branches in the Multiscale CNN and the influence of the depth of the GRU. Some insights and motivation for the future steps are also discussed.

1)
As we go wider by using multi branches to extract more features, it is important to point out that this operation increases the model’s parameters, which results in the difficulty in training and the risk of over fitting. Here, based on dataset c_{1} we compare six numbers of branches (2, 4, 6, 8, 10, 20) in a MCGRU and the MAE and RMSE results are illustrated in Table 8. It shows that as the number of branches increases, the performance of the model gets better and then remains almost the same, and when there are 10 or 20 branches, the performance gets worse, which means that blindly increasing the number of branches does harm to the model and cannot improve its performance. Here we finally adopt 6 branches of CNN in our MCGRU.

2)
The depth of the model also affects the performance of the model. We change the layers of GRU in the MCGRU to explore the impact of the depth. The number of layers of GRU is set as (2, 4, 6, 8) and the results are shown in Table 9. It is clear that the performance of these four models is almost the same. A reasonable explanation is that in our two experiments, there is no sufficient labeled data and therefore a shallow depth of GRU is powerful enough to discover the information behind the data. When there is a large amount of data, a GRU of more layers can be tried to further improve the capability of the model.

3)
The robustness of a model is important to evaluate the performance of a model. In real working environment, the quality of the samples signals may be influenced by the noise. It is important and interesting to build a model that is robust when there is a large amount of noise. And in our settings, different signals are combined directly, it is meaningful to design a better way of fusing the data from different sensors.
Conclusions

(1)
In this paper, we proposed a Multiscale Convolutional Gated Recurrent Unit Network (MCGRU) to address tool wear prediction task. We interpret the structure of this model by introducing the feature extractor: Multiscale CNN and the encoder: Deep GRU. The Multiscale CNN is able to extract both local and abstract features by kernels of different sizes, and the Deep GRU is capable of capturing longterm dependencies and learning significant representations based on the features extracted in Multiscale CNN.

(2)
Moreover, the GRU performs better when there is no sufficient labeled data in real working conditions. Profiting from these advantages, the MCGRU is able to make accurate and effective tool wear prediction based on raw sensory data, without expert knowledge and feature engineering. Its satisfactory performance is further verified by two experimental cases and the comparisons with other models.
References
T Li, Z Zhao, C Sun, et al. Multireceptive field graph convolutional networks for machine fault diagnosis. IEEE Transactions on Industrial Electronics, 2020, DOI: https://doi.org/10.1109/TIE.2020.3040669.
Z Mo, J Wang, H Zhang, et al. Weighted cyclic harmonictonoise ratio for rolling element bearing fault diagnosis. IEEE Transactions on Instrumentation and Measurement, 2020, 69(2): 432442.
L L Cui, X Wang, Y G Xu, et al. A novel switching unscented Kalman filter method for remaining useful life prediction of rolling bearing, Measurement, 2019, 135: 678684.
Huaqing Wang, Shi Li, Liuyang Song, et al. A novel convolutional neural network based fault recognition method via image fusion of multivibrationsignals, Computers in Industry, 2019, 105: 182190.
N Ghosh, Y B Ravi, A Patra, et al. Estimation of tool wear during CNC milling using neural networkbased sensor fusion. Mechanical Systems & Signal Processing, 2007, 21: 466479.
D E Dimla. Sensor signals for toolwear monitoring in metal cutting operations—A review of methods. International Journal of Machine Tools and Manufacture, 2000, 40(8): 10731098.
Y C Yen, J Söhner, B Lilly, et al. Estimation of tool wear in orthogonal cutting using the finite element analysis. Journal of Materials Processing Technology, 2004, 146(1): 8291.
J S Strenkowski, J T Carroll. A finite element model of orthogonal metal cutting. Journal of Engineering for Industry, 1985, 107(4): 349354.
E Ceretti, P Fallböhmer, W T Wu, et al. Application of 2D FEM to chip formation in orthogonal cutting. Journal of Materials Processing Technology, 1996, 59(12): 169180.
I S Jawahir, O W Dillonjr, A K Balajj, et al. Predictive modeling of machining performance in turning operations. Machining Science and Technology, 1998, 2: 253276.
T Ozel, M Lucchi, C A Rodríguez, et al. Prediction of chip formation and cutting forces in flat end milling: comparison of process simulations with experiments. Technical PaperSociety of Manufacturing Engineers, 1998, 98(250): 1–6.
M Shatla, Y C Yen, T Altan. Toolworkpiece interface in orthogonal cuttingapplication of FEM modeling. TransactionsNorth American Manufacturing Research Institution of SME, 2000: 173–178.
T J Ko, W C Dong. Cutting state monitoring in milling by a neural network. International Journal of Machine Tools & Manufacture, 1994, 34: 659676.
ÖZEL Tugrul, K Yigit. Predictive modeling of surface roughness and tool wear in hard turning using regression and neural networks. International Journal of Machine Tools & Manufacture, 2005, 45: 467479.
V S Sharma, S K Sharma, A K Sharma. Cutting tool wear estimation for turning. Journal of Intelligent Manufacturing, 2008, 19: 99108.
K V Rao, B S N Murthy, N M Rao. Prediction of cutting tool wear, surface roughness and vibration of work piece in boring of AISI 316 steel with artificial neural network. Measurement, 2014, 51: 6370.
R Zhao, R Yan, J Wang, et al. Learning to monitor machine health with convolutional bidirectional LSTM networks. Sensors, 2017, 17(2): 273.
J Wang, W Peng, R X Gao. Enhanced particle filter for tool wear prediction. Journal of Manufacturing Systems, 2015, 36: 3545.
Z Rui, D Wang, R Yan, et al. Machine health monitoring using local featurebased gated recurrent unit networks. IEEE Transactions on Industrial Electronics, 2017, 99: 11.
T Juri, S Emilia, P Eduardo, et al. Validation of intersubject training for hidden Markov models applied to gait phase detection in children with cerebral palsy. Sensors, 2015, 15: 2451424529.
K Wei, W Lenan. Mobile location with NLOS identification and mitigation based on modified Kalman filtering. Sensors, 2011, 11: 16411656.
H D Yang. Sign language recognition with the kinect sensor based on conditional random fields. Sensors, 2015, 15: 135147.
J Schmidhuber. Deep learning in neural networks: An overview to Neural Netw., 2015, 61: 85117.
K Cho, B V Merrienboer, C Gulcehre, et al. Learning phrase representations using RNN encoderdecoder for statistical machine translation. Computer Science, 2014.
P.d.c. PHM Society, https://www.phmsociety.org/competition/phm/10, 2010.1.
Y L Cun, B Boser, J S Denker, et al. Handwritten digit recognition with a backpropagation network. Advances in Neural Information Processing Systems, 1990, 2(2): 396404.
T Li, Z Zhao, C Sun, et al. WaveletKernelNet: An interpretable deep neural network for industrial intelligent diagnosis. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2021, DOI:https://doi.org/10.1109/TSMC.2020.3048950.
T Li, Z Zhao, C Sun, et al. Adaptive channel weighted CNN with multisensor fusion for condition monitoring of helicopter transmission system. IEEE Sensors Journal, 2020, 20(15): 83648373.
O AbdelHamid, A R Mohamed, J Hui, et al. Applying convolutional neural networks concepts to hybrid NNHMM model for speech recognition. IEEE International Conference on Acoustics, 2012: 4277–4280.
Y Kim to Convolutional Neural Networks for Sentence Classification, Eprint Arxiv, (2014). arXiv:1408.5882
Z Rui, K Mao. Topicaware deep compositional models for sentence classification. IEEE/ACM Transactions on Audio Speech & Language Processing, 2017, 25: 248260.
S Ioffe, C Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. JMLR.org, 2015: 448–456.
C Szegedy, W Liu, Y Jia, et al. Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015: 19, doi: https://doi.org/10.1109/CVPR.2015.7298594.
S Hochreiter, J Schmidhuber. Long shortterm memory. Neural Computation, 1997, 9: 17351780.
F A Gers, J Schmidhuber, F Cummins. Learning to forget: continual prediction with LSTM. International Conference on Artificial Neural Networks, 1999: 850–855.
S Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen, Diploma. Technische Universität München, 1991.
R Huang, Y Liao, S Zhang, et al. Deep decoupling convolutional neural network for intelligent compound fault diagnosis. IEEE Access, 2019, 7: 18481858.
C Sun, M Ma, Z Zhao, et al. Deep transfer learning based on sparse autoencoder for remaining useful life prediction of tool in manufacturing. IEEE Transactions on Industrial Informatics, 2018, PP(4): 11.
F Jia, Y Lei, J Lin, et al. Deep neural networks: A promising tool for fault characteristic mining and intelligent diagnosis of rotating machinery with massive data. Mechanical Systems and Signal Processing, 2016, 72: 303315.
H Shao, H Jiang, H Zhang. Electric locomotive bearing fault diagnosis using a novel convolutional deep belief network. IEEE Transactions on Industrial Electronics, 2018, 65(3): 27272736.
C Sun, M Ma, Z Zhao, et al. Sparse deep stacking network for fault diagnosis of motor. IEEE Transactions on Industrial Informatics, 2018, 14: 32613270.
E O Ezugwu, S J Arthur, E L Hines. Toolwear prediction using artificial neural networks. Journal of Materials Processing Technology, 1995, 49: 255264.
V Nair, G E Hinton. Rectified linear units improve restricted Boltzmann machines. International Conference on International Conference on Machine Learning, 2010.
G E Hinton. Learning multiple layers of representation. Trends in Cognitive Sciences, 2007, 11: 428434.
Y Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2009.
T Tieleman, G Hinton. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012, 4: 2631.
N Srivastava, G Hinton, A Krizhevsky, et al. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014, 15: 19291958.
X Li, B Lim, J Zhou, et al. Fuzzy neural network modelling for tool wear estimation in dry milling operation. Annual Conference of the Prognostics and Health Management Society, 2009, 1(1): 1–11.
Acknowledgements
The authors sincerely thanks to Mr. Tianfu Li for his critical discussion and reading during manuscript preparation.
Funding
Supported in part by Natural Science Foundation of China (Grant Nos. 51835009, 51705398), Shaanxi Province 2020 Natural Science Basic Research Plan (Grant No. 2020JQ042), and Aeronautical Science Foundation (Grant No. 2019ZB070001).
Author information
Affiliations
Contributions
WX: Writing, review and editing; HM: review and discussion; ZZ: review; JL: review; CS: Revision, editing and supervision; RY: Review and supervision. All authors read and approved the final manuscript.
Authors’ Information
Weixin Xu, born in 1994, received the M.S. degree in mechanical engineering from Xi’an Jiaotong University, China, in 2020. His current research is focused on signal processing and deep learning algorithms for machinery health monitoring.
Huihui Miao, born in 1989, is currently a PhD candidate at Xi’an Jiaotong University, China. She received his bachelor degree from Xi’an Jiaotong University, China, in 2011. Her current research interest lies in machine learning for machinery modeling, monitoring, and diagnosis.
Jinxin Liu, born in 1988, is currently an associate professor at Xi’an Jiaotong University, China. He received the PhD degree from Xi’an Jiaotong University, China, in 2016. His current research interests include active noise and vibration control, adaptive filter and control theory, precision engineering and control, condition monitoring, and system development.
Zhibin Zhao, born in 1993, is currently a lecturer at Xi’an Jiaotong University, China. He received the PhD degree from Xi’an Jiaotong University, China, in 2020. His current research is focused on sparse signal processing and machine learning algorithms for machinery health monitoring and healthcare.
Chuang Sun, born in 1986, is currently an associate professor at Xi’an Jiaotong University, China. He received the PhD degree from Xi’an Jiaotong University, China, in 2014. His research interests include manifold learning, deep learning, sparse representation, mechanical fault diagnosis and prognosis, and remaining useful life prediction.
Ruqiang Yan, born in 1975, is currently a Professor at Xi’an Jiaotong University, China. He received the PhD degree from University of Massachusetts Amherst, USA, in 2007. His research interests include nonlinear timeseries analysis, multidomain signal processing, and energyefficient sensing and sensor networks for the condition monitoring and health diagnosis of largescale, complex, dynamical systems.
Corresponding author
Ethics declarations
Competing Interests
The authors declare no competing financial interests.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Xu, W., Miao, H., Zhao, Z. et al. MultiScale Convolutional Gated Recurrent Unit Networks for Tool Wear Prediction in Smart Manufacturing. Chin. J. Mech. Eng. 34, 53 (2021). https://doi.org/10.1186/s10033021005654
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s10033021005654
Keywords
 Tool wear prediction
 Multiscale
 Convolutional neural networks
 Gated recurrent unit