Multi-Scale Convolutional Gated Recurrent Unit Networks for Tool Wear Prediction in Smart Manufacturing

As an integrated application of modern information technologies and artificial intelligence, Prognostic and Health Management (PHM) is important for machine health monitoring. Prediction of tool wear is one of the symbolic applications of PHM technology in modern manufacturing systems and industry. In this paper, a multi-scale Convolutional Gated Recurrent Unit network (MCGRU) is proposed to address raw sensory data for tool wear prediction. At the bottom of MCGRU, six parallel and independent branches with different kernel sizes are designed to form a multi-scale convolutional neural network, which augments the adaptability to features of different time scales. These features of different scales extracted from raw data are then fed into a Deep Gated Recurrent Unit network to capture long-term dependencies and learn significant representations. At the top of the MCGRU, a fully connected layer and a regression layer are built for cutting tool wear prediction. Two case studies are performed to verify the capability and effectiveness of the proposed MCGRU network and results show that MCGRU outperforms several state-of-the-art baseline models.


Introduction
The development of Prognostic and Health Management (PHM) has motivated the research in the field of machine health monitoring to detect faults and predict machine's future conditions [1][2][3][4]. In modern manufacturing system, the worn tool is harmful for the metal cutting process and often causes additional costs [5]. The cutting tools will gradually become blunt during the manufacturing process, as shown in Figure 1, because of a lot of factors like abrasion, deformation and attrition. As a result, the quality of the products will be degraded. It is therefore crucial to monitor and predict the cutting tool wear online so as to prevent the quality from degradation [6].
Aiming at monitoring the working conditions of cutting tools and predicting tool wear, many methods, direct or indirect, online or offline, have been researched. Traditionally, by performing cutting tests under different working conditions, data about the cutting tools are acquired and then analyzed with the help of optimization techniques including the response surface methodology (RSM) and the design of experiments (DOE). This approach is time-consuming and inefficient because the number of the tests required is large [7]. The finite element method (FEM) [8][9][10] has also been used in different cutting tasks [11,12] and to predict cutting tool wear. Over the last two decades, methods based on deep learning and neural networks have started to be used for the estimation and prediction of cutting tool wear. Ko et al. [13] designed an autoregressive model followed by a highly parallel neural network to monitor the cutting state. Özel et al. [14] used neural networks for the prediction of cutting tool wear and its surface roughness.

Open Access
Chinese Journal of Mechanical Engineering Ghosh et al. [6] designed a sensor fusion model with the help of neural network to extract and fuse features from various signals for the estimation of cutting tool's average flank wear. By using Adaptive Neuro Fuzzy Inference System, Sharma et al. [15] developed a method for tool wear estimation. Venkata et al. [16] fed the cutting speed, the radius of nose and the volume of removed material to a multilayer perceptron model for the prediction of amplitude vibration, surface roughness, and tool wear. Zhao et al. [17] designed a convolutional bi-directional LSTM networks to monitor machine health, as well as to predict the tool wear depth. In general, researchers divide these methods into two major categories: physicsbased methods and data-driven methods. In tasks of tool wear prediction, physics-based methods based on grey models and particle filters [18] have proven to be effective. However, these methods usually require accurate and high-quality domain knowledge, which is often unavailable under complex and noisy working conditions. Moreover, most of them are unable to be upgraded with online data. Data-driven methods are now more attractive because they are able to address these issues. Deep learning theories and large amounts of data collected by advanced sensors have promoted the development of data-driven online methods. Two phases are usually included in data-driven models [19], where the first phase is to train models with collected data and then the other phase is to apply the trained models to online data to monitor the conditions or make predictions. The key of these two phases is that deep learning theories enable the model to better extract features and derive representations of machine conditions hidden in the data, and therefore enable it to make better predictions based on online data. In this paper, we research data-driven methods with the help of deep learning theories to predict cutting tool wear. Data-driven methods take single or multiple sensory data as input, feeding them into training models to extract features and learn representations. Online data will then be fed into the well trained models to make predictions. Data and models are two core parts of datadriven methods. Figure 2 shows the basic framework of data-driven methods. The raw time series data collected by sensors are in sequential forms, whose sequential characteristics are difficult to be discovered by previous work focusing on developing models to extract multidomain features. These models, trying to extract statistical, frequency and time-frequency features, require intensive expert knowledge and feature engineering. Some models, such as the Markov models and Kalman filters [20][21][22], are capable of addressing sequential data, but they are not good at capturing long-range dependencies. It is important to capture connections and information in time scale because in real working conditions, the features are often submerged by heavy background noise, which will cause failures in these models. The development in the field of neural networks and deep learning has offered solutions to address these issues, and one of these solutions is Recurrent Neural Network (RNN). Traditionally, neural networks deal with inputs  (2021) 34:53 and outputs independently, which is not so reasonable in some sequential tasks. RNN is proposed to make use of information in arbitrary long sequences, and to capture the calculated information. However, the problem of gradient exploding and vanishing in traditional RNN weakens its power. Some improved variants of RNN have been designed to solve this problem, and one of them is LSTM, namely Long Short-Term Memory Network. LSTM is good at solving problems that in need of information about previous events [23], which means that it is better at addressing sequential data of various length and capturing long-term dependencies. LSTM needs sufficient data to train while in real working conditions, while there may be no sufficient labeled data. Gated Recurrent Unit (GRU) network performs better under such situations. Proposed in 2014, GRU [24] is a more efficient variant of LSTM that shares many similar properties. With comparable performance to LSTM on sequence modeling, GRU has fewer parameters and is easier to train. Here, we introduce GRU networks to be one part of our network architecture. As one type of neural networks, GRU is able to extract features and learn representations without expert or domain knowledge, but it may be not robust because of the existence of noise in the raw sensory data. Compared with GRU, Convolutional Neural Network (CNN) is more robust when the data has noise interference. The convolutional operations in the CNN are able to extract abstract features by applying learnable filters to the convolutional layers to convolve with sequential data. For this reason, in Ref. [17], Zhao et al. adopted a onelayer CNN as a local feature extractor. However, as the information hidden in the sequential data is complicated and diverse, this local feature extractor with only one kernel size cannot extract all of the useful information. To address this issue, filters of different sizes are adopted to form a multi-scale convolutional layer to extract different significant features. Here, we use this multi-scale convolutional network to extract hidden but important features and then these features will be concatenated into a single feature map.
In this paper, we propose a model combining multiscale CNN and GRU named Multi-scale Convolutional Gated Recurrent Unit Network (MCGRU) to predict cutting tool wear. In this model, the multi-scale CNN consists of six parallel branches, and they are independent of each other. These branches are able to extract local features, as well as abstract ones from high level. Then these feature maps will be merged into a single feature map. Temporal information is encoded and representations are learnt in a two-layer GRU network, built on the top of the merge layer. We experiment on an open source dataset from cutters of high-speed Computer Numerical Control (CNC) milling machine, containing acoustic emission data, accelerometer data and dynamometer data [25]. Additionally, another experiment of CNC tool wear is carried out, through which current and vibration data and tool wear depth are sampled. Based on these sequential data and their corresponding tool wear depth, we compare the predicting ability of our model with that of several state-of-the-art models.
This paper is organized as follows. In Section 2, we review some related work about CNN and GRU. Based on classic CNN and GRU, the MCGRU is designed and its details are presented in Section 3. Two case studies on the prediction of tool wear are conducted in Section 4. More details about the model and future steps are discussed in Section 5. Finally, conclusions are presented in Section 6.

Convolutional Neural Network
Convolutional Neural Network has proven very powerful in many recognition and classification tasks [26][27][28]. It has also shown the power to address sequential data in task of natural language processing [29][30][31]. In the convolutional layers, filters slide over sequential data to extract features and filters in the pooling layers will focus on the most salient ones. Additionally, the training process can be sped up and the model's performance can be improved by adding batch normalization layers [32]. The capability of CNN can be further improved by stacking the above layers to build a "deep" CNN. Besides, the width of CNN can also influence its performance. In the inception module [33], parallel branches consisting of convolutional and pooling layers with different kernels are designed. This architecture allows the model to recover both local features via kernels of smaller sizes and high abstract features via that of larger sizes.
In our MCGRU network, an architecture of six parallel branches of CNN is designed to process the input sequential data before it is fed to GRU units. Kernels with different sizes are adopted in different branches to extract local and abstract features at the same time. The model itself is going to determine which features are significant to be chosen.

RNN, LSTM and GRU
Recurrent Neural Network (RNN) is mainly proposed to handle long term dependencies while processing sequential data in task of natural language processing. The hidden states in RNN use the outputs of the previous states as the inputs of the next states, which means that the sequential information is preserved. As weights are shared across time, RNN is able to process sequential input of any length. However, the problem of gradient exploding and vanishing emerged as the major obstacle to traditional RNN's performance. To avoid this problem, Long Short-Term Memory Network was proposed by Sepp and Jürgen [34] in 1997 and was improved by Felix Gers' team [35] in 2000. It is able to prevent backpropagated error from vanishing [36] and memorize a state for different time periods with the help of the input gate, forget gate and output gate, which manage the flow of information in the network. As another variant of RNN, Gated Recurrent Unit (GRU) was introduced to solve the vanishing problem. With only a reset gate and an update gate, GRU has comparable performance to LSTM. However, there are fewer parameters in GRU because it lacks an output gate and has less complex structure, which means that it is more efficient and can be used under situations where there is no sufficient data. Considering the effectiveness of GRU, it has been more and more widely used to learn significant representations in time series data.
In the proposed MCGRU network, two-layer GRUs are adopted to process the output of the multi-scale convolutional layers. Effective representations will be learnt here and to be used in the prediction of tool wear.

Neural Networks and Tool Wear Prediction
Neural networks have been successfully used in tasks of machine condition monitoring like tool wear prediction because of their excellent features extraction and representations learning capabilities [37][38][39][40][41][42]. Artificial Neural Networks (ANN) were firstly adopted and were proved to have good performance in the machine condition monitoring tasks. However, with more and more interference and as the working conditions become more and more complex, ANN is no longer good at solving these problems. As a result, Convolutional Neural Networks (CNN) was introduced in this field. The depth of the networks and their learning ability enable themselves to learn what they need in these tasks. However, most of these models make use of the features manually extracted and designed from raw data, while ignoring the representations and the relations of different time steps hidden in the sequential data. For tool wear prediction, the information about the condition of cutting tools remains to be discovered. Our proposed MCGRU combines multi-scale CNN with GRU to learn features and representations without the intervention of human designed features, which the information behind the raw sensory data can be explored as much as possible and the prediction accuracy will be improved.

Model
What we desire is an arm exoskeleton which is capable of following motions of the human upper-limb accurately and supplying the human upper-limb with proper force feedback if needed. In order to achieve an ideal controlling performance, we have to examine the structure of the human upper-limb.
Before presenting the MCGRU network, some notations used in this paper are clarified here. The task is to design a model for tool wear prediction based on the multiple in-process sensory data. A labeled time series dataset is given as , which contains N tool conditions, and their corresponding labels y i , i.e., each tool condition corresponding to a tool wear that is measured and recorded as y i . Assuming that in each tool condition x i , q channels of sensory data are sampled and the length of each channel of sensory data is L . For each channel, the whole sequence is divided into l sections, i.e., l time steps. The ith cutting tool condition is where vector x t i ∈ R d is the multiple channels of sensory data sampled at time step t , i.e., the tth section, and d = q * (L/l) is the dimensionality of x t i , and (·) T represents the transpose. The goal is to predict the tool wear ŷ i through x i . In our proposed Multi-scale Convolutional Gated Recurrent Unit, the Multi-scale Convolutional Network functions as a feature extractor and the Gated Recurrent Unit functions as a temporal encoder. Six parallel and independent branches of Convolutional Neural Network consisting of different kernels are designed to process the raw sensory data. Local and abstract features extracted are fed into a merge layer, on the top of which is a two-layer GRU designed to learn significant representations. Finally, the prediction is performed by a fully connected layer and a regression layer. The MCGRU network is shown in Figure 3.

Multi-Scale CNN
In each branch of the multi-scale CNN, a five-layer CNN is adopted, which consists of two convolutional layers, one max-pooling layer and two batch normalization layers. In the first convolutional layer of each branch, the kernel size equals to 1. The adoption of this convolutional layer is not only able to help extract more significant features, but also able to reduce the parameters of the model. For example, in one-dimensional convolution, a CNN containing one convolutional layer with kernel size equaling to 7 has more parameters than that containing two convolutional layers with kernel sizes equaling to 1 and 7, respectively. A batch normalization layer is adopted at the top of the first convolutional layer. Batch normalization layers in hidden layers help to accelerate the training and augment the predicting accuracy. Then, the output of the batch normalization layer is fed into the second convolutional (1) into a single feature map. All of the features extracted from branches are reserved. The organization of these two kinds of structure is shown in Figure 4. Details are presented in the following contents respectively. Here we take the operations in one branch as example, and operations in other branches are the same.

Convolution
In the convolutional layer of each branch in the Multiscale CNN, the 1-dimensional convolution operation is achieved by using a filter (kernel) v ∈ R h×d to  slide over x i ∈ R l×d to convolve with the subsection x t:t+h−1 i ∈ R h×d from time step t to time step t + h − 1 . The x t:t+h−1 i is given as follows: where h is the kernel size. Additionally, a bias term b is added to get the complete convolution operation, which can be given as: where j ∈ R represents the jth filter v and • represents the Hadamard product.
As the filter slides over x i and the convolution operation is done, we get a vector c j , which is given by: where p is the amount of zero padding, s is the sliding stride of the kernel, and (l − h + 2p)/s + 1 is the length of the output after convolution operation. When s and p are set, the length of the output depends on the kernel Specially, to concatenate different outputs from different branches, it is more meaningful to get outputs with same size. Therefore, the trick of zero padding is adopted in the convolution operation. In different branches, different amounts of zero padding are adopted, which helps the output to have the same size as the input, no matter which kernel size is chosen. As a result, the feature map can be given by:

Batch Normalization
Instead of just normalizing the input of the CNN, we adopt batch normalization [32] layers to normalize the inputs within the network by using the variance and the mean of the values in the current mini-batch. In the batch normalization layer, the operation can be represented as follows: As batch normalization layer does not change the feature map's size, we therefore get:

Activation Function
After the convolution and batch normalization operations, an activation function is added to bring in non-linear properties and therefore to learn non-linear complex arbitrary functional relationships between inputs and outputs. As a result, the convolution, batch normalization and activation operations can be together given by: where f (·) is an activation function. Here, we choose Rectified Linear Units (ReLU) [43] as the activation function in our proposed model.
The above three operations result in a feature map, which can be given by (5)

Max-Pooling
By introducing pooling layers in the network, the previous feature maps' size can be further reduced and more significant and abstract features can be extracted. Here, we adopt max-pooling operation. In one-dimensional pooling, with the pooling length k , the max-pooling operation uses a kernel to slide over the feature map to get the max value over the k consecutive values. Here we let the sliding stride equal to k , and as a result, the output of max-pooling operation can be given by:

Concatenation
In the concatenating layer, the feature maps from different branches will be concatenated into a single feature map to merge all the local and abstract features. Assuming that in the ith branch, the jth output of this branch is given by: Then, the output of the concatenation layer is given as: where N is the serial number of branch, and M i can be represented as where K i is the number of output of the ith branch.
To summarize, in the Multi-scale CNN, the shape of the input sequence is n × l × d . Here, n represents the total number of working conditions. As descripted above, before the concatenating layer, the output shape of the ith branch is n × ((l − k + 2p)/s + 1) × K i . In different branches, kernel sizes from small to large help to extract local and abstract features. Compared to the original raw sequence, these multi-time-scale features can better represent the properties of the working conditions. As these features are merged in the concatenating layer, the following GRU is added to learn significant representations of the working conditions. To be more specific, (11)  the framework of the Multi-scale CNN is illustrated in Figure 5.

Deep GRU
Under real industrial conditions, clean sample data is difficult to obtain. Compared to LSTM, GRU is better at dealing with such situations where there is no sufficient data. Here, on the top of the Multi-scale CNN, a twolayer GRU network is designed to excavate vital representations from the multi-time-scale features. The deep GRU is presented as follows.

Gated Recurrent Unit
In GRU, the inputs are the hidden state h t−1 at previous time step t − 1 and the data x t at the current time step t , and the output is the hidden state h t . The output h t depends on the previous hidden state h t−1 , the update gate z t , the reset gate r t and the candidate hidden state h t . The reset gate r t enables the unit to drop any information in the hidden state that is less meaningful or irrelevant, so as to focus on the information that is more important. The update gate z t determines the information from the previous and the candidate hidden state that can be passed to the current hidden state [23]. The relating equations can be given by: where σ is the sigmoid activation function, W z , U z , W r , U r , W h and U h are shared weight matrices which are learned during training, b z , b r , b h are learnable biases. The basic structure of a one-layer GRU is shown in Fig. 6.

Deep GRU Gated
As mentioned above, the capability of a neural network can be improved by "going deeper" [44,45]. In a deep neural network, there exists more non-linear operations and more abstract features and representations can be learned. Inspired by this idea, we stack two GRU layers to get a deep architecture, in which each GRU layer contains different number of units. In the deep GRU, as shown in Figure 7, while the output of each hidden state in one layer propagating through time, it is also the input of the hidden state in the next layer. Features at low level are therefore learned and passed to the next layer to learn (14)  higher-level representations. By stacking GRU layers, the network is able to learn essential representations at different time scales more effectively.

Fully Connected and Linear Regression Layer
The output representation of GRU network is flattened as h and then fed into a fully connected layer to be prepared for the linear regression layer. The operation of the fully connected layer can be given as follows: where o is the output of the fully connected layer, W is the transformation matrix, b is the bias, and f (·) is the activation function. We use ReLU here as the activation function. Finally, the fully connected layer's output o is fed into a regression layer and the tool wear of the ith working condition is therefore predicted, which can be given by

Training and Regularization of MCGRU
The Mean Absolute Error (MAE) is adopted as the loss in the training process, which is given by: where n represents the total number the samples. The optimizer we adopt here is Root Mean Square Propagation (RMSProp) [46]. It is a very robust optimizer with pseudo curvature information. RMSProp is useful for mini batch learning because the gradients are normalized by the magnitude of the recent gradients, enabling it to handle stochastic objectives properly. RMSProp is a nice optimizer for recurrent neural networks like LSTM and GRU.
As mentioned above, GRU rather than LSTM is chosen in our proposed model because in real working conditions, there is usually no sufficient labeled data. When going deep and when there is no sufficient data, the network may be too complex to train and the problem of overfitting may appear. In order to solve this problem, regularization methods should be added within the network. Here, we adopt a Dropout [47] layer after the GRU network, as well as after the fully connected layer. Dropout layer enables the network to ignore those neurons that are randomly selected during the process of forward propagation. Therefore, the network will not rely too much on some local features. In our proposed model, we only use dropout during training process, but not in testing process, and the dropout ratio is set to be 0.3. (19)

Descriptions of Datasets
The first experiment is a high speed CNC machine running under dry milling operations [48]. This dataset is presented on the "prognostic data challenge 2010" database [25]. The experimental platform and the details are shown in Figure 8. In this experiment, six cutters are used to cut over an identical workpiece while each cutter made 315 cuts. When training and testing our model, six channels of data including forces and vibrations are used. A LEICA MZ12 microscope was utilized to measure the flank wear of each flute when the experiment was finished. The values of the wear were then taken as the target value. In this dataset, six cutting tools are used to do the experiment, which means six collections of data (C 1 , C 2 , · · · , C 6 ) can be used. To compare with the results in Ref. [17], we adopt three cutting tools, i.e., three data collections C 1 , C 4 and C 6 as our training and testing sets here. Each data collection contains 315 samples, corresponding to 315 tool wear. To make good use of this dataset, a three-fold strategy is adopted. Among these three data collections, two of them are taken as the training set and the other is the testing one. As a result, we get three cases. For example, when C 1 is testing set and C 4 , C 6 are training test, this case is denoted as c 1 . The other two cases c 4 , c 6 can be deduced from the above example. The details of these three cases are shown in Table 1.  Figure 8 Details of the CNC machine and the data collected system  As the sampling frequency is too high, for each channel, the sampled sequence is divided by 512 to get several sections, and the first forty sections are used. As a result, each original sequence is transformed into a datum with a length of 40, and therefore at each time step, the dimensionality is 3072 (6 channels). As descripted above, in the training process, the input shape of the network is 630×40×3072 and in the testing process, that of the network is 315×40×3072.

Experiment Setup
The following models shown in Table 2 will be compared with our proposed MCGRU model. Regression models including LR, SVR and MLP, cannot process sequential data directly, and hence we firstly extract the related features. Here, ten features, containing statistical features, frequency features and time-frequency features are extracted from raw data. Details are shown in Table 3. As there are six channels of signals, the dimensionality of the input is 60. In LR, there is no hyper parameter. In SVR, the regularization parameter is set as 0.1 and the kernel is Radial Basis Function (RBF). As for the MLP, the parameters of three hidden layers are set as (140, 280, 900) and we choose ReLU as the activation function.
The other compared models are able to address sequential data directly. The input shape is therefore 40×3072. The five-layer CNN has the same structure as one branch of our proposed MCGRU. The kernel sizes in the two convolutional layers are 1 and 7, quantities of kernels are 32 and 64, and the pooling size is set as 2. The setting of the MCNN (Multi-scale Convolutional Neural Networks) is the same as that of our MCGRU. As for the basic recurrent models, including RNN, LSTM and GRU, the quantity of units is set as 192. And for the deep recurrent models, including Deep RNN, Deep LSTM, and Deep GRU, the quantity of the units in two layers is set as (180, 240). The CBLSTM, that is Convolutional Bi-Directional LSTMs, is proposed by Zhao et al. [17]. Here, the same settings in [17] are adopted for this model. The CGRU (Convolutional GRU) has a five-layer CNN with the same settings as the previous CNN model and a twolayer GRU with units (180, 240).
In our proposed MCGRU, from branch 1 to branch 6, the kernel size of the convolutional layers is set as (1, 1), (1,3), (1,5), (1,7), (1,9), (1,11) , and the quantity of kernels is set as (32,64). The kernel size of the pooling layer in all the branches is set as 2. Here, as the input shape is 40×3072, and the zero padding is adopted, the output shapes of each branch are the same, that is 40×32. Then, these six outputs are concatenated to get an output shape of 40×192. The quantity of units in the next two GRU layers is set as (180, 240) and the output units of the fully connected layer is set as 120. All of the activation functions in our model are Rectified Linear Unit (ReLU).
To evaluate the capability of the previous models, the Mean Absolute Error (MAE) and the Root Mean Squared Error (RMSE) are adopted. The MAE focuses on the average magnitude of the errors, without considering their direction. As a quadratic scoring rule, the RMSE measures the average magnitude of the error. The MAE is given in Eq. (20), and the RMSE is given by: where ŷ i represents the predicted tool wear and y i is the actual tool wear.
These models are trained and tested using a Linux Server with two NVIDIA 1080Ti GPUs and a 4.2 GHz INTEL i7-7700K CPU.

Descriptions of Datasets
The second experiment is of the reliability test of CNC machine tool. It is carried out on a CNC machine, as showed in Figure 9. The cutting tool, as showed in Figure 10, is utilized to process a 45# steel bar, and the relating parameters are shown in Table 4.
As shown in Table 4, three channels of data are sampled, including the vibration signal, the AE-RMS signal and the current signal of the spindle motor. The tool wear corresponding to each working condition is also measured and recorded to be the label. Here, we did the experiment with 9 cutting tools and got 840 samples in total. All of the cutting tools have the same initial tool wear. Each sample corresponds to a working condition. To be different from the first experimental case, in this case, we randomly chose samples from all of the cutting tools as the testing set and the training set. Both the training set and the testing set contain samples from the nine cutting tools. As a result, the training set include 735 of the 840 samples, and the rest are used as the testing set. Similarly, as the sampling frequency is too high, for each signal, the sequence is divided by 256 to get several sections, and the first 20 sections are selected. Hence, each data sample is transformed into a sequential datum with a length of 20, and at each time step, the dimensionality is 768 (3 channels). As descripted above, in the training process, the input shape of the network is 735 × 20 × 768 and in the testing process, that of the network is 105 × 20 × 768.

Experiment Setup
The setup of this second experiment is almost the same as the first one. Same compared models and same settings for these models are adopted. The two indexes used to evaluate the performance of the models are the MAE and the RMSE. The models are trained and tested using a Linux Server with two NVIDIA 1080Ti GPUs and a 4.2 GHz INTEL i7-7700K CPU.

Results
In this section, the comparison based on the MAE and RMSE of the above models are shown. Table 5 shows the MAE of each model in the first experimental case, while Table 6 shows the RMSE. The MAE and RMSE of each model in the second experiment are shown in Table 7.
As shown in the tables, the regression models LR, SVR and MLP have shown their capability to make prediction of tool wear based on the features extracted from raw data, while they are not as good as the convolutional models and recurrent models. Linear Regression performs worst in this task because it is a linear model in nature, which cannot make full use of the extracted features to make predictions. The SVR and the MLP perform better because the nonlinearity is introduced into the models so that the relationships among different features can be better explored. In SVR, by adopting different kernels, the samples can be mapped to a high dimensional space. The RBF kernel we chose here shows its power to address regression tasks. The MLP is able to search an efficient mapping mode actively, which is effective and different from SVR.
However, compared to the above regression models, the convolutional models and recurrent models can address raw data to learn significant features and representations, which enables them to have better performance. By choosing different kernels, CNN is able to extract local or abstract features. In this task, MCNN performs better than CNN because it contains kernels of different sizes to extract local and abstract features in the same time. The depth and width we introduced here in these two models also help in predicting tool wear. Long-term dependencies in sequential data cannot be discovered by convolution operation, but it can be captured by recurrent models. As shown in the tables, in this task of addressing time series data, recurrent models perform slightly better than convolutional models. Here, LSTM and GRU are better than basic RNN because their gates enable them to be more powerful to capture long-term dependencies. What's more, in this task, the amount of data is not large, which shows the GRU's advantages under the situation where there is no sufficient data. Here, the GRU performs better than LSTM. And as expected, the three deep recurrent models perform better than the three normal ones. In Ref. [17], the proposed Convolutional Bi-Directional LSTMs combines a local feature extractor CNN with a temporal encoder deep Bi-Directional LSTMs, which is able to excavate useful features hidden in the raw sensory data in both forward and backward ways. The CBLSTMs performs better than most of the above convolutional and recurrent models. As for the CGRU, it performs well, as the GRU is also able to learn significant representations on  basis of the features extracted by the CNN. Specially, in the second experimental case, the deep GRU and CGRU perform even better than the CBLSTMs, which shows the power of the GRU in dealing with small amount of data. Our proposed model, the MCGRU, performs best among these compared models. The result reveals that there is much information hidden behind raw sensory data that cannot be discovered by human designed features, while Multi-scale CNN is able to filter the noise from real working environment and explore the information as much as possible. The deep GRU is able to excavate the temporal information to find a more accurate relationship between the input and output, namely the raw sensory data and the predicted tool wear. As the network goes wider, more meaningful features of different time scales can be discovered, and as it goes deeper, the abstract and significant representations can be learnt. The combination of the multi-scale features extractor Multi-scale CNN and the temporal encoder deep GRU is therefore proven to perform well in the task of tool wear prediction.
To be more specific, the prediction of the tool wear, the corresponding actual tool wear, and the error between these two values are illustrated in Figures 11,12,13,14. It is shown that in the first three figures, i.e., in the results of the first experimental case, the trend of the degradation of the cutting tool is robustly captured and error is acceptable. Specially, in Figure 14, nine ascending curves can be found in the curve of actual tool wear, that's because in the second experimental Figure 11 Results of the first experimental case, when c 1 is the testing set: the prediction of the tool wear, the corresponding actual tool wear, and the error between these two values Figure 12 Results of the first experimental case, when c 4 is the testing set: the prediction of the tool wear, the corresponding actual tool wear, and the error between these two values Figure 13 Results of the first experimental case, when c 6 is the testing set: the prediction of the tool wear, the corresponding actual tool wear, and the error between these two values case, we have sampled data from all of the nine cutting tools to be the testing set and each ascending curve represents the data from a cutting tool. In this case, the results are also satisfying. Moreover, for each epoch, it consumes about 1 s to train. When testing, it consumes only 0.8 s to predict the tool wear of about 300 samples, which means that our proposed model is efficient enough to be used in real-time prediction.

Discussion
In this section, we discuss the impact of the number of the branches in the Multi-scale CNN and the influence of the depth of the GRU. Some insights and motivation for the future steps are also discussed.
1) As we go wider by using multi branches to extract more features, it is important to point out that this operation increases the model's parameters, which results in the difficulty in training and the risk of over fitting. Here, based on dataset c 1 we compare six numbers of branches (2,4,6,8,10,20) in a MCGRU and the MAE and RMSE results are illustrated in Table 8. It shows that as the number of branches increases, the performance of the model gets better and then remains almost the same, and when there are 10 or 20 branches, the performance gets worse, which means that blindly increasing the number of branches does harm to the model and cannot improve its performance. Here we finally adopt 6 branches of CNN in our MCGRU.
2) The depth of the model also affects the performance of the model. We change the layers of GRU in the MCGRU to explore the impact of the depth. The number of layers of GRU is set as (2,4,6,8) and the results are shown in Table 9. It is clear that the performance of these four models is almost the same. A reasonable explanation is that in our two experiments, there is no sufficient labeled data and therefore a shallow depth of GRU is powerful enough to discover the information behind the data. When there is a large amount of data, a GRU of more layers can be tried to further improve the capability of the model.

3) The robustness of a model is important to evaluate
the performance of a model. In real working environment, the quality of the samples signals may be influenced by the noise. It is important and interesting to build a model that is robust when there is a large amount of noise. And in our settings, different signals are combined directly, it is meaningful to design a better way of fusing the data from different sensors.

Conclusions
(1) In this paper, we proposed a Multi-scale Convolutional Gated Recurrent Unit Network (MCGRU) to address tool wear prediction task. We interpret the structure of this model by introducing the feature extractor: Multi-scale CNN and the encoder: Deep GRU. The Multi-scale CNN is able to extract both local and abstract features by kernels of different sizes, and the Deep GRU is capable of capturing long-term dependencies and learning significant representations based on the features extracted in Multi-scale CNN. (2) Moreover, the GRU performs better when there is no sufficient labeled data in real working conditions. Profiting from these advantages, the MCGRU is able to make accurate and effective tool wear prediction based on raw sensory data, without expert knowledge and feature engineering. Its satisfactory performance is further verified by two experimental cases and the comparisons with other models.