What we desire is an arm exoskeleton which is capable of following motions of the human upper-limb accurately and supplying the human upper-limb with proper force feedback if needed. In order to achieve an ideal controlling performance, we have to examine the structure of the human upper-limb.

Before presenting the MCGRU network, some notations used in this paper are clarified here. The task is to design a model for tool wear prediction based on the multiple in-process sensory data. A labeled time series dataset is given as \(D = \left\{ {\left( {{\varvec{x}}_{i} ,y_{i} } \right)} \right\}_{i = 1}^{N}\), which contains *N* tool conditions, and their corresponding labels \(y_{i}\), i.e., each tool condition corresponding to a tool wear that is measured and recorded as \(y_{i}\). Assuming that in each tool condition \({\mathbf{x}}_{i}\), \(q\) channels of sensory data are sampled and the length of each channel of sensory data is \(L\). For each channel, the whole sequence is divided into \(l\) sections, i.e., \(l\) time steps. The *i*th cutting tool condition is

$$ {\varvec{x}}_{i} = \left[ {\begin{array}{*{20}c} {{\varvec{x}}_{i}^{1} } & {{\varvec{x}}_{i}^{2} } & \cdots & {{\varvec{x}}_{i}^{l} } \\ \end{array} } \right]^{{\text{T}}} , $$

(1)

where vector \({\varvec{x}}_{i}^{t} \in R^{d}\) is the multiple channels of sensory data sampled at time step \(t\), i.e., the *t*th section, and \(d = q * \left( {L/l} \right)\) is the dimensionality of \({\varvec{x}}_{i}^{t}\), and \(\left( \cdot \right)^{{\text{T}}}\) represents the transpose. The goal is to predict the tool wear \(\hat{y}_{i}\) through \({\varvec{x}}_{i}\). In our proposed Multi-scale Convolutional Gated Recurrent Unit, the Multi-scale Convolutional Network functions as a feature extractor and the Gated Recurrent Unit functions as a temporal encoder. Six parallel and independent branches of Convolutional Neural Network consisting of different kernels are designed to process the raw sensory data. Local and abstract features extracted are fed into a merge layer, on the top of which is a two-layer GRU designed to learn significant representations. Finally, the prediction is performed by a fully connected layer and a regression layer. The MCGRU network is shown in Figure 3.

### 3.1 Multi-Scale CNN

In each branch of the multi-scale CNN, a five-layer CNN is adopted, which consists of two convolutional layers, one max-pooling layer and two batch normalization layers. In the first convolutional layer of each branch, the kernel size equals to 1. The adoption of this convolutional layer is not only able to help extract more significant features, but also able to reduce the parameters of the model. For example, in one-dimensional convolution, a CNN containing one convolutional layer with kernel size equaling to 7 has more parameters than that containing two convolutional layers with kernel sizes equaling to 1 and 7, respectively. A batch normalization layer is adopted at the top of the first convolutional layer. Batch normalization layers in hidden layers help to accelerate the training and augment the predicting accuracy. Then, the output of the batch normalization layer is fed into the second convolutional layer. The second convolutional layers with different kernels in different branches extract multiple time scale features hidden in the sequential data. Small kernels are able to extract local features, while large kernels are able to extract abstract features. Based on these multi-time-scale features, the model itself learns to determine which ones should be concerned about. The last two layers are another batch normalization layer and a max-pooling layer. The max-pooling layer compresses the previous feature maps to further learn more significant features. Then in a merge layer, all of the feature maps from different branches are concatenated into a single feature map. All of the features extracted from branches are reserved. The organization of these two kinds of structure is shown in Figure 4. Details are presented in the following contents respectively. Here we take the operations in one branch as example, and operations in other branches are the same.

#### 3.1.1 Convolution

In the convolutional layer of each branch in the Multi-scale CNN, the 1-dimensional convolution operation is achieved by using a filter (kernel) \({\varvec{v}} \in R^{h \times d}\) to slide over \({\varvec{x}}_{i} \in R^{l \times d}\) to convolve with the subsection \({\varvec{x}}_{i}^{t:t + h - 1} \in R^{h \times d}\) from time step \(t\) to time step \(t + h - 1\). The \({\varvec{x}}_{i}^{t:t + h - 1}\) is given as follows:

$$ {\varvec{x}}_{i}^{t:t + h - 1} = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {\begin{array}{*{20}c} {{\varvec{x}}_{i}^{t} } & {{\varvec{x}}_{i}^{t + 1} } \\ \end{array} } & \cdots \\ \end{array} } & {{\varvec{x}}_{i}^{t + h - 1} } \\ \end{array} } \right]^{{\text{T}}} $$

(2)

where \(h\) is the kernel size. Additionally, a bias term \(b\) is added to get the complete convolution operation, which can be given as:

$$ c_{j}^{t} = {\varvec{v}}_{j} \circ {\varvec{x}}_{i}^{t:t + h - 1} + b $$

(3)

where \(j \in R\) represents the *j*th filter *v* and \(\circ\) represents the Hadamard product.

As the filter slides over \({\varvec{x}}_{i}\) and the convolution operation is done, we get a vector \({\varvec{c}}_{j}\), which is given by:

$$ {\varvec{c}}_{j} = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {\begin{array}{*{20}c} {c_{j}^{1} } & {c_{j}^{2} } \\ \end{array} } & \cdots \\ \end{array} } & {c_{j}^{{_{{\left( {l - h + 2p} \right)/s + 1}} }} } \\ \end{array} } \right]^{{\text{T}}} $$

(4)

where \(p\) is the amount of zero padding, \(s\) is the sliding stride of the kernel, and \(\left( {l - h + 2p} \right)/s + 1\) is the length of the output after convolution operation. When \(s\) and \(p\) are set, the length of the output depends on the kernel size \(h\). Different kernels size results in different output sizes. It is an important point because in each branch of our proposed Multi-scale CNN, different kernel sizes are chosen.

Specially, to concatenate different outputs from different branches, it is more meaningful to get outputs with same size. Therefore, the trick of zero padding is adopted in the convolution operation. In different branches, different amounts of zero padding are adopted, which helps the output to have the same size as the input, no matter which kernel size is chosen. As a result, the feature map can be given by:

$$ {\varvec{c}}_{j} = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {\begin{array}{*{20}c} {c_{j}^{1} } & {c_{j}^{2} } \\ \end{array} } & \cdots \\ \end{array} } & {c_{j}^{l} } \\ \end{array} } \right]^{{\text{T}}} $$

(5)

#### 3.1.2 Batch Normalization

Instead of just normalizing the input of the CNN, we adopt batch normalization [32] layers to normalize the inputs within the network by using the variance and the mean of the values in the current mini-batch. In the batch normalization layer, the operation can be represented as follows:

$$ {\varvec{bn}}_{j} = {\text{BN}}\left( {{\varvec{c}}_{j} } \right). $$

(6)

As batch normalization layer does not change the feature map’s size, we therefore get:

$$ {\varvec{bn}}_{j} = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {\begin{array}{*{20}c} {{\text{BN}}\left( {c_{j}^{1} } \right)} & {{\text{BN}}\left( {c_{j}^{2} } \right)} \\ \end{array} } & \cdots \\ \end{array} } & {{\text{BN}}\left( {c_{j}^{l} } \right)} \\ \end{array} } \right]^{{\text{T}}} . $$

(7)

#### 3.1.3 Activation Function

After the convolution and batch normalization operations, an activation function is added to bring in non-linear properties and therefore to learn non-linear complex arbitrary functional relationships between inputs and outputs. As a result, the convolution, batch normalization and activation operations can be together given by:

$$ a_{j}^{t} = f\left( {{\text{BN}}\left( {{\varvec{v}}_{j}^{{\text{T}}} {\varvec{x}}_{i}^{t:t + s - 1} + b} \right)} \right), $$

(8)

where \(f\left( \cdot \right)\) is an activation function. Here, we choose Rectified Linear Units (ReLU) [43] as the activation function in our proposed model.

The above three operations result in a feature map, which can be given by

$$ {\varvec{a}}_{j} = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {\begin{array}{*{20}c} {a_{j}^{1} } & {a_{j}^{2} } \\ \end{array} } & \cdots \\ \end{array} } & {a_{j}^{l} } \\ \end{array} } \right]^{{\text{T}}} . $$

(9)

#### 3.1.4 Max-Pooling

By introducing pooling layers in the network, the previous feature maps’ size can be further reduced and more significant and abstract features can be extracted. Here, we adopt max-pooling operation. In one-dimensional pooling, with the pooling length \(k\), the max-pooling operation uses a kernel to slide over the feature map to get the max value over the \(k\) consecutive values. Here we let the sliding stride equal to \(k\), and as a result, the output of max-pooling operation can be given by:

$$ {\varvec{m}}_{j} = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {\begin{array}{*{20}c} {m_{j}^{1} } & {m_{j}^{2} } \\ \end{array} } & \cdots \\ \end{array} } & {m_{j}^{{\left( {l - k + 2p} \right)/s + 1}} } \\ \end{array} } \right]^{{\text{T}}} , $$

(10)

where \(m_{j}^{i} = \max \left( {a_{j}^{{\left( {i - 1} \right)s}} ,a_{j}^{{\left( {i - 1} \right)s + 1}} , \cdots ,a_{j}^{{\left( {i - 1} \right)s + k - 1}} } \right)\).

#### 3.1.5 Concatenation

In the concatenating layer, the feature maps from different branches will be concatenated into a single feature map to merge all the local and abstract features. Assuming that in the *i*th branch, the *j*th output of this branch is given by:

$$ {\varvec{m}}_{ij} = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {\begin{array}{*{20}c} {m_{ij}^{1} } & {m_{ij}^{2} } \\ \end{array} } & \cdots \\ \end{array} } & {m_{ij}^{{\left( {l - k + 2p} \right)/s + 1}} } \\ \end{array} } \right]^{{\text{T}}} . $$

(11)

Then, the output of the concatenation layer is given as:

$$ {\varvec{Concatenation}}{ = }\left[ {\begin{array}{*{20}c} {{\varvec{M}}_{{1}} } & {{\varvec{M}}_{{2}} } & \cdots & {{\varvec{M}}_{N} } \\ \end{array} } \right] $$

(12)

where \(N\) is the serial number of branch, and \({\mathbf{M}}_{i}\) can be represented as

$$ {\varvec{M}}_{i} = \left[ {\begin{array}{*{20}c} {{\varvec{m}}_{i1} } & {{\varvec{m}}_{i2} } & \cdots & {{\varvec{m}}_{{iK_{i} }} } \\ \end{array} } \right], $$

(13)

where \(K_{i}\) is the number of output of the *i*th branch.

To summarize, in the Multi-scale CNN, the shape of the input sequence is \(n \times l \times d\). Here, \(n\) represents the total number of working conditions. As descripted above, before the concatenating layer, the output shape of the *i*th branch is \(n \times \left( {\left( {l - k + 2p} \right)/s + 1} \right) \times K_{i}\). In different branches, kernel sizes from small to large help to extract local and abstract features. Compared to the original raw sequence, these multi-time-scale features can better represent the properties of the working conditions. As these features are merged in the concatenating layer, the following GRU is added to learn significant representations of the working conditions. To be more specific, the framework of the Multi-scale CNN is illustrated in Figure 5.

### 3.2 Deep GRU

Under real industrial conditions, clean sample data is difficult to obtain. Compared to LSTM, GRU is better at dealing with such situations where there is no sufficient data. Here, on the top of the Multi-scale CNN, a two-layer GRU network is designed to excavate vital representations from the multi-time-scale features. The deep GRU is presented as follows.

#### 3.2.1 Gated Recurrent Unit

In GRU, the inputs are the hidden state \({\varvec{h}}_{t - 1}\) at previous time step \(t - 1\) and the data \({\varvec{x}}_{t}\) at the current time step \(t\), and the output is the hidden state \({\varvec{h}}_{t}\). The output \({\varvec{h}}_{t}\) depends on the previous hidden state \({\varvec{h}}_{t - 1}\), the update gate \({\varvec{z}}_{t}\), the reset gate \({\varvec{r}}_{t}\) and the candidate hidden state \(\tilde{\user2{h}}_{t}\). The reset gate \({\mathbf{r}}_{t}\) enables the unit to drop any information in the hidden state that is less meaningful or irrelevant, so as to focus on the information that is more important. The update gate \({\mathbf{z}}_{t}\) determines the information from the previous and the candidate hidden state that can be passed to the current hidden state [23]. The relating equations can be given by:

$$ {\varvec{z}}_{t} = \sigma \left( {{\varvec{W}}_{z} {\varvec{x}}_{t} + {\varvec{U}}_{z} {\varvec{h}}_{t - 1} + {\varvec{b}}_{z} } \right), $$

(14)

$$ {\varvec{r}}_{t} = \sigma \left( {{\varvec{W}}_{r} {\varvec{x}}_{t} + {\varvec{U}}_{r} {\varvec{h}}_{t - 1} + {\varvec{b}}_{r} } \right), $$

(15)

$$ \tilde{\user2{h}}_{t} = \tanh \left( {{\varvec{W}}_{h} {\varvec{x}}_{t} + {\varvec{U}}_{h} \left( {{\varvec{r}}_{t} \circ {\varvec{h}}_{t - 1} } \right) + {\varvec{b}}_{h} } \right), $$

(16)

$$ {\varvec{h}}_{t} = \left( {1 - {\varvec{z}}_{t} } \right) \circ {\varvec{h}}_{t - 1} + {\varvec{z}}_{t} \circ \tilde{\user2{h}}_{t} , $$

(17)

where \(\sigma\) is the sigmoid activation function, \({\varvec{W}}_{z}\), \({\varvec{U}}_{z}\), \({\varvec{W}}_{r}\), \({\varvec{U}}_{r}\), \({\varvec{W}}_{h}\) and \({\varvec{U}}_{h}\) are shared weight matrices which are learned during training, \({\varvec{b}}_{z}\), \({\varvec{b}}_{r}\), \({\varvec{b}}_{h}\) are learnable biases. The basic structure of a one-layer GRU is shown in Fig. 6.

#### 3.2.2 Deep GRU Gated

As mentioned above, the capability of a neural network can be improved by “going deeper” [44, 45]. In a deep neural network, there exists more non-linear operations and more abstract features and representations can be learned. Inspired by this idea, we stack two GRU layers to get a deep architecture, in which each GRU layer contains different number of units. In the deep GRU, as shown in Figure 7, while the output of each hidden state in one layer propagating through time, it is also the input of the hidden state in the next layer. Features at low level are therefore learned and passed to the next layer to learn higher-level representations. By stacking GRU layers, the network is able to learn essential representations at different time scales more effectively.

### 3.3 Fully Connected and Linear Regression Layer

The output representation of GRU network is flattened as *h* and then fed into a fully connected layer to be prepared for the linear regression layer. The operation of the fully connected layer can be given as follows:

$$ {\varvec{o}} = f\left( {{\varvec{Wh}} + {\varvec{b}}} \right), $$

(18)

where *o* is the output of the fully connected layer, *W* is the transformation matrix, *b* is the bias, and \(f\left( \cdot \right)\) is the activation function. We use ReLU here as the activation function. Finally, the fully connected layer’s output *o* is fed into a regression layer and the tool wear of the *i*th working condition is therefore predicted, which can be given by

$$ \hat{y}_{i} = {\varvec{Wo}}_{i} . $$

(19)

### 3.4 Training and Regularization of MCGRU

The Mean Absolute Error (MAE) is adopted as the loss in the training process, which is given by:

$$ loss = MAE = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left| {y_{i} - \hat{y}_{i} } \right|} , $$

(20)

where *n* represents the total number the samples.

The optimizer we adopt here is Root Mean Square Propagation (RMSProp) [46]. It is a very robust optimizer with pseudo curvature information. RMSProp is useful for mini batch learning because the gradients are normalized by the magnitude of the recent gradients, enabling it to handle stochastic objectives properly. RMSProp is a nice optimizer for recurrent neural networks like LSTM and GRU.

As mentioned above, GRU rather than LSTM is chosen in our proposed model because in real working conditions, there is usually no sufficient labeled data. When going deep and when there is no sufficient data, the network may be too complex to train and the problem of overfitting may appear. In order to solve this problem, regularization methods should be added within the network. Here, we adopt a Dropout [47] layer after the GRU network, as well as after the fully connected layer. Dropout layer enables the network to ignore those neurons that are randomly selected during the process of forward propagation. Therefore, the network will not rely too much on some local features. In our proposed model, we only use dropout during training process, but not in testing process, and the dropout ratio is set to be 0.3.