Practical Options for Adopting Recurrent Neural Network and Its Variants on Remaining Useful Life Prediction

The remaining useful life (RUL) of a system is generally predicted by utilising the data collected from the sensors that continuously monitor different indicators. Recently, different deep learning (DL) techniques have been used for RUL prediction and achieved great success. Because the data is often time-sequential, recurrent neural network (RNN) has attracted significant interests due to its efficiency in dealing with such data. This paper systematically reviews RNN and its variants for RUL prediction, with a specific focus on understanding how different components (e.g., types of optimisers and activation functions) or parameters (e.g., sequence length, neuron quantities) affect their performance. After that, a case study using the well-studied NASA’s C-MAPSS dataset is presented to quantitatively evaluate the influence of various state-of-the-art RNN structures on the RUL prediction performance. The result suggests that the variant methods usually perform better than the original RNN, and among which, Bi-directional Long Short-Term Memory generally has the best performance in terms of stability, precision and accuracy. Certain model structures may fail to produce valid RUL prediction result due to the gradient vanishing or gradient exploring problem if the parameters are not chosen appropriately. It is concluded that parameter tuning is a crucial step to achieve optimal prediction performance .


Introduction
Remaining useful life (RUL) prediction is an engineering discipline that works on the prediction of the future state or response of a given system based on synthesis observations, calibrated mathematical models, and simulations [1]. It generally refers to the study of predicting the specific time at which the system or the component will no longer be able to have its intended functional performance. Salunkhe et al. [2] regard RUL as the time left before observing a failure. Okoh et al. [3] define RUL as the time remaining for a component to perform its functional capabilities before failure. It is of great importance to predict the RUL of a component or a system in the industrial world, as it helps to prevent failures or accidents from happening. For example, the failure of the aircraft engine would often lead to major accidents and casualties [4]. Thus, it is essential to predict the RUL of the engine, implement maintenance accordingly and eventually prevent catastrophic failure. The degradation process of an operating device is a process of gradual deterioration and can be detected to a certain extent through the measurement of covariate variables [5]. In recent years, RUL prediction has attracted vast attention from both academic researchers and industrial operators.
RUL prediction approaches are generally catalogued into model-based (physics-based) methods, data-driven methods and hybrid models, which is a combination of the first two methods [5]. As the complex and noisy working condition impedes the construction of the physical systems, it results in difficulties in developing the modelling of complex dynamic systems [6]. In addition, the difficulty to be updated with the online measured data, limits the effectiveness and flexibility of the physics-based models. In contrast, data-driven approaches are gaining popularity due to its quick implementation and widespread deployment of low-cost sensors and their connection to the internet, where RUL is computed through statistical and probabilistic methods by utilising historic information and routinely monitored data of the system [7]. The precondition for setting up the datadriven models for RUL prediction is the availability of the multivariate historical data about the system behaviour, which must encompass all phases of the system operation and degradation scenarios under certain operating conditions. In recent years, Artificial intelligence (AI) techniques, particularly deep learning (DL) techniques are becoming more and more attractive because of the rapid growth in the industrial Internet of Things (IoT), Big Data and increasing computing power [8]. Researchers have exploited applications of AI techniques for RUL prediction as well.
Deep learning is one of the sub-branches of machine learning, which originated from the Artificial Neural Network (ANN) and featuring multiple nonlinear processing layers. It intends to model hierarchical representations and predicts patterns behind data through building stacked multiple layers of information processing modules in hierarchical architectures. With the rapid development of computational infrastructure and the availability of a large volume of data, DL has become one of the main research topics in the field of prognostics, given its capability to capture the hierarchical relationship embedded in deep structures [9]. The published literature on DL approaches for RUL prediction mainly focus on four representative deep architectures, including Auto-encoder (AE), Deep Belief Network (DBN), Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) [10]. AE and DBN are often used for the pre-training of networks. For instance, Jia et al. [11] developed a stacked denoising autoencoder (SDA) which is fed with the frequency spectra of time-series data to do the rotating machinery diagnosis. Chen et al. [12] proposed an SDA to identify the health state of certain systems with signals containing ambient noise and working condition fluctuations. Shao et al. [13] developed a deep AE based method to diagnose rotating machinery fault. Liao et al. [14] proposed to combine an enhanced Restricted Boltzmann Machine (RBM) with a novel regularisation term to automatically extract the features which are suitable for RUL prediction. Gan et al. [15] presented a hierarchical diagnosis network that combines a wavelet packet transform (WPT) and DBN for consecutive identification of bearing fault location and severity. Thus, research suggests that CNN and RNN are generally used as predictive models and have proved to outperform traditional prognosis algorithms in RUL prediction. CNN based approaches are used more in fault diagnosis and surface integration inspection [16]. RNN, on the other hand, gained much more attention and achievements in the research of RUL prediction because of its ability to accommodate time sequence data [17]. Therefore, this paper systematically reviews the applications of RNN and its variants for RUL prediction in recent years. Many novel RNN based methods have been proposed, and the performance of the RUL performance has been greatly improved. However, most of these works just focused on how to achieve a better prediction performance using a certain approach. Very few researchers paid attention to some other factors that also affect the prediction result such as the optimizer, activation function, neuron number and sequence length. Taking the optimizers as an example, they are used to shape the model into its most accurate form through futzing with the weights. To the best of our knowledge, there is no research discussing how different optimizers affect the performance of RUL prediction using DL based approaches, and what is the underlying principle to optimise the selection. To fill these research gaps, this paper not only presents an evaluation of the basic RNN and its variants on RUL prediction based on a case study in a publicly available dataset, but a specific investigation has also been carried out on how different components (e.g., types of optimisers and activation functions) or parameters (e.g., sequence length, neuron quantities) of these approaches affect the overall performance (e.g., stability, precision, accuracy) of the RUL prediction.
The remainder of this paper is organized as follows. Section 2 briefly introduces the basic conception of RNN and its variants. Section 3 presents the different optimizers that is normally used in DL. Section 4 explains how activation functions affect the training of the network and demonstrate the advantages and drawbacks of different activation functions. Section 5 presents a case study that aims to evaluate the factors that influence the performance of RUL prediction based on a publicly available dataset.

RNN
In a traditional neural network, inputs are independent, while in RNN, the front neurons pass the information to the following neurons. As illustrated in Figure 1, in contrast to a traditional feed-forward neural network, an RNN can be regarded as numerous copies of the same neural network cell, in which each cell passes the message to the next through the hidden state. In other words, the output from a recurrent neuron is connected to the next one to characterise the current system state as a function of current sensing data and the preceding system state.
In an unrolled RNN, the sensing data (…x t−1 , x t , x t+1 …) are fed simultaneously into the corresponding neurons, which generate the corresponding neuron time series (… h t−1 , h t , h t+1 …). The output of a single recurrent neuron can be expressed as: where W x , W h and W y represent the weight vectors respectively. The symbol b and c denote the bias term and σ is the activation function, with the hyperbolic tangent or Relu being commonly used in RNN. y t is the output of the recurrent neuron based on the output of the hidden state h t , which can be referred to as a memory space containing the information of the current input and the former hidden state h t−1 . It is worth mentioning that all the weight vectors are shared at every step, which means that the same task is repeated at every step with different inputs and the memory is renewed accordingly.

LSTM
The main issue of the standard RNN is the gradient exploring and the gradient vanishing. These issues might happen when the network is too deep. In other words, when the number of the time step is too large, the information carried in the front neuron will be lost because no structure in a standard recurrent layer individually (2) y t = softmax W y h t + c , controls the flow of the memory itself. To solve this problem, the Long Short-Term Memory (LSTM) network, a modified structure of the recurrent cell that incorporates the standard recurrent layer along with additional "memory" control gates, has been proposed. The basic structures of RNN, LSTM and GRU are illustrated in Figure 2.
The original LSTM was developed by Hochreiter and Schmidhuber [18] when researchers discovered a vanishing and exploding gradient issue in traditional RNNs. LSTM uses storage elements to transfer information from the past output instead of having the output of the RNN cell to be a non-linear function of the weighted sum of the current input and the previous output.
In another words, instead of using a hidden state h only, LSTM adopts a cell state C to keep the long-term information as shown in Figure 3. The main concept of LSTM is utilising three gates to control the cell state C (forget gate, input gate and output gate). The forget gate is used to control the information from the previous cell state C t−1 to the current cell state C t ; the input gate decides how many inputs should be kept in the current cell state C t ; and the output gate determines the output h t from the current cell state C t .
The output of LSTM at step t is calculated using the following equations: where W and b are the trainable weights and biases, respectively, and i , f and o represent the input gate, forget gate and output gate respectively. These three gates have the same shape with different parameters U and W , which need to be learned from the training process. The candidate state ∼ c t cannot be used directly. It must pass through the input gate and then be used to calculate the internal storage C t . While C t is not only affected by the hidden state but also by C t−1 which is controlled by the forget gate. Based on C t , a layer of tanh function is applied to the output information h t , which is constrained by the output gate. The existence of the gates enables LSTM to fulfil the long-term dependencies in the sequence, and by learning the gate parameters, the network can find the appropriate internal storage. Therefore, LSTMs are naturally suited for RUL prediction tasks using sensor data with the inherent sequential nature due to their capability of remembering information over long periods. Yuan et al. [19] proposed an LSTM approach for different types of faults, where C-MPASS dataset was used as the case study. Compared to the traditional RNN, Gated Recurrent Unit LSTM (GRU-LSTM) and AdaBoost-LSTM showed improved performance in all cases. They developed a vanilla LSTM approach two years later which further improved the prediction performance significantly [20]. A multi-layer LSTM approach provided by Zheng et al. [17] investigated the hidden patterns from sensors and operational data with multiple operating conditions, fault and degradation models by combining multiple layers of LSTM cells with standard feed-forward layers. The superiority of this approach in RUL prediction was validated by three widely used data

GRU
The shortcoming of LSTMs is that it is usually time-consuming due to the forget gate, input gate and output gate added to the structure of the memory blocks. To address this problem, an improved structure, named Gated Recurrent Unit (GRU), was proposed [21]. GRU is the latest generation of RNN, and it looks very similar to LSTM. Instead of using the cell state, GRU uses the hidden state to transfer information. Moreover, it only has two gates (a reset gate and update gate) instead of three. Similar to the forget and input gate of LSTM, the function of the update gate is to decide what information to keep and what to throw away. The function of the reset gate is to decide what to keep from the past information.
The output of GRU at step t is calculated using the following equations: Since there are fewer tensor operations in GRU, it runs relatively faster when training the structure than LSTM. However, the accuracy is behind LSTM due to fewer gates. Thus, when the computational resource is limited, or fast training is required, GRU could be a good option. For instance, Chen et al. [22] adopted a GRU network to predict the RUL for a complex system featured with multiple components, multiple states and a large number of parameters.

Bi_directional LSTM
In recent years, there is another variant of RNN called Bi_directional LSTM (Bi_LSTM) that can be seen frequently in literature. The Bi-directional LSTM is proposed with the information flowing back to the former LSTM cells. The forward flow of information can discover the system variation, and it flows back to smooth the predictions as illustrated in Figure 4. The outputs of the forward path and the backward path are then concatenated. The governing equations of Bi-directional LSTM can be presented as: where Eq. (13) refers to the forward path and Eq. (14) refers to the backward path, y i is the output of the Bidirectional LSTM obtained by fusing the results from both directional paths.
As for the application, Zhao et al. [23] presented an integrated approach of CNN and bi-directional LSTM for machining tool wear prediction named Convolutional Bi-directional Long Short-Term Memory (CBLSTM) networks. CNN was firstly used to extract local robust features from the sequential input. Then, Bi-directional LSTM was utilised to encode temporal information. The proposed CBLSTM's capability of predicting the RUL of actual tool wear based on raw sensory data was verified with a real-life tool wear test. Zhang et al. [24] presented a Bi-directional LSTM network to discover the underlying patterns embedded in time-series to track the system degradation. The Bi-directional LSTM network was implemented to track the variation of the health index, and the RUL was predicted by the recursive onestep ahead method. Elsheikh et al. [25] built a Bidirectional Handshaking LSTM (BHLSTM) network for RUL prediction, where short sequences of monitored observations were given with random initial wear. This method was able to predict the RUL with a random start, which makes it more suitable for real-world application as the initial condition of physical systems is usually unknown, especially in terms of its manufacturing deficiencies.

Optimizer
Gradient descent by far is the most commonly used way to optimise neural network [26]. It is an iterative optimization algorithm used to find the values of parameters or coefficients of a function that minimizes a cost function. Although various algorithms have been developed to optimize gradient descent, they are usually used as blackbox optimizers because it is hard to figure out the practical explanations of their strengths and weaknesses. Different in how much data used to compute the gradient of the objective function, the gradient descent variants are classified into two categories: batch gradient descent (BGD) and stochastic gradient descent (SGD). BGD is guaranteed to converge to a global minimum for convex error surfaces and a local minimum for non-convex surfaces. However, BGD can be very time-consuming because it needs to calculate the gradients for the whole dataset to perform just one update and thus it is intractable for datasets that do not fit in memory. In addition, BGD cannot be used to update the model online. In contrast, SGD performs one update at a time, and thus it will not have any redundant computations for large datasets as BGD does. As a result, SGD is usually much fast than BGD. Meanwhile, it can be used to learn the model online. The drawback of SGD is that the frequent updates with a high variance would lead to a heavy fluctuation to the objective function. While if the learning rate is slowly decreased over time, SGD shows the same convergence behaviour as BGD, it almost certainly converges to a local or the global minimum for non-convex optimization.
Although SGD can often lead to good convergence, few challenges need to be addressed. For instance, it is difficult to determine a proper learning rate and an annealing schedule, or it is hard to update features to a different extent avoiding suboptimal minima. Ruder [26] outlines some algorithms that are widely used by the deep learning community which can deal with these challenges includes Momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop, Adam, AdamMax and Nadam. Ruder also stated that Adagrad, Adadelta, RMSprop and Adam can all significantly improve the robustness of SGD and do not need much manual tuning of the learning rate. These four optimizers are therefore selected and discussed in more detail in this paper.

Adagrad
Adagrad is a gradient-based optimizer that adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent updates. Thus, it is very suitable for sparse data. It uses a different learning rate for every parameter θ i at every time step t, so the gradient of the objective function g t,i regarding the parameter θ i at time step t is written as: The SGD updates for every parameter θ i at each time step t following equation: Adagrad modifies the general learning rate η at each time step t for every parameter θ i based on the past gradients: where G t ∈ R d×d is a diagonal matrix where each diagonal element i is the sum of the squares of the gradients regarding the parameter θ i at time step t, ∈ is a smoothing term used to avoid division by zero.
One of the main advantages of Adagrad is that it is not required to manually tune the learning rate. The default value is set as 0.01. The main drawback of this optimizer is that its accumulation of the squared gradients in the denominator would result in the learning rate to shrink and become infinitesimally small, which means that at a certain point, the algorithm can no longer acquire additional knowledge.

Adadelta
To reduce the monotonically decreasing learning rate, an extension optimizer of Adagrad has been promoted, named Adadelta. It uses a fixed-size window of accumulated past gradients instead of accumulating all past squared gradients. The sum of the gradient is recursively defined as a decaying average of all past squared gradients. Thus, the running average of the squared gradients of the objective function at time step t depends on the previous average and the current gradient: where γ is the fraction of the update vector of the past time step to the current update vector, which is normally set to 0.9 [26]. (16)  The SGD update for parameter �θ t at each time step t therefore becomes: And according to the update rule, through simply replacing the diagonal matrix G t with the decaying average of past squared gradients E[g 2 ] t , the parameter update vector of Adadelta can be derived as: As E g 2 t + ∈ is the root mean squared (RMS) error criterion of the gradient, it can then be written as: Since the update should have the same hypothetical units as the parameter, the exponentially decaying average of the squared parameter should be used: Based on the update rule of Adadelta, there is no need for setting a default learning rate.

RMSprop
RMSprop is an adaptive learning rate method designed for neural networks which have been growing in popularity in recent years. Similar to Adadelta, the central idea of RMSprop is to keep the moving average of the squared gradients for each weight and then divide the gradient by square root of the mean square. However, a good default value of decay parameter γ and learning rate are set to 0.9 and 0.001: (20) �θ t = −η · g t,i ,

Adam
Another method that computes adaptive learning rates for each parameter was named Adaptive Moment Estimation (Adam) [27]. Adam not only stores an exponentially decaying average of past squared gradients v t , but also keeps an exponentially decaying average of past gradients m t , as shown in Eqs. (29) and (30): where m t refers to the estimate of the first moment (the mean) of the gradients and v t refers to the second moment (the uncentered variance) of the gradients. As the initial value of m t and v t are vectors of zeros, it is observed that when the decay rates are small during the initial time, they are biased towards zero. The biases are counteracted by computing bias-corrected first and second moment estimates: Therefore, the update rule of Adam can be derived as: The proposed values for β 1 , β 2 and ∈ are 0.9, 0.999 and 10 −8 , respectively.

Activation Function
The activation function is a function working on a neuron in an ANN and mapping the input of the neuron to the output. More specifically, each neuron node in the neural network adapts the output of the neuron in the upper layer as the input and passes it to the next layer (hidden layer or output layer). Thus, the activation function refers to the functional relationship between the output of the upper node and the input of the lower node in the multilayer neural network. Without an activation function, the input of each layer will be linear to the output of the upper layer. No matter how many layers the neural network has, the output is just a linear combination of the input, which is similar to the original perceptron. To enable the neural network arbitrarily to any nonlinear function, the activation function introduces a nonlinear factor to the neuron. The nonlinear activation functions allow the network to learn complex (29)  form data and complex function mappings that represent nonlinearity between input and output. There are three types of activation functions normally used in the deep learning area: tanh & sigmoid, ReLU and swish. In this section, the basic mathematical expression of these three types of activation functions is reviewed with their advantages and drawbacks. The expression of these activation functions and their variants are demonstrated in Figure 5.

Sigmoid & Tanh
Sigmoid function, expressed in Eq. (34), also known as Logistic function, is normally used for the output of the hidden layer neurons: The advantage of Sigmoid function is that the output of the activation function is limited between 0 and 1, which results in a stable optimization and thus good to be used as the output layer. The drawback is that the function could be very insensitive to small changes in input when a variable takes a very large positive or negative value. During the backpropagation, the weight will hardly be updated when the gradient gets close to zero. Therefore, the gradient will disappear, and the network will be able to complete its training. In addition, the output of sigmoid function is not zero mean, which leads to the input of neurons in the back layer being non-zero mean, and then affects the gradient. Besides, due to the exponential form in the sigmoid function, the computational complexity is very high.
Tanh function, expressed in Eq. (35), is also called the hyperbolic tangent function: Tanh function is the translation and contraction of sigmoid function: tanh(x)=2⋅σ(2x)−1. Tanh function often outperforms sigmoid in practice because its output is zero mean. Nevertheless, it still suffers from gradient saturation and computational complexity.

ReLU
Rectification of linear unit (ReLU) is the most commonly used deep learning neural network activation function. It is the default activation function for most of the feed-forward neural networks. The ReLU function is written as: The advantage of ReLU function is that the SGD algorithm converges faster than sigmoid or tanh. When the weight is larger than zero, there are no problems like gradient saturation and gradient disappearance. Since there is no need to carry out the exponential operation, the computational complexity is relatively low. A threshold is needed for achieving the activation value. The limitation of ReLU function is that the output is not zero mean either. Besides, the Dead ReLU Problem will occur when the weight is in the negative field. During the training, when x is less than zero, the gradient of the current neuron and the neurons after it is always zero. In other words, it will no longer respond to any data and the corresponding parameters would never be updated. To solve this problem, Leaky ReLU, Parametric Rectified Linear Unit (PReLU) and Exponential Linear Unit (ELU) were introduced.
Leaky ReLU function: PReLU function: ELU: The Leaky ReLU uses a small value of 0.01 to initialize the neuron so that the ReLU function can be activated in the negative region. The difference between Leaky ReLU and PReLU is that α of PReLU function is learned through backpropagation. ELU has all the advantage of ReLU and no Dead ReLU Problem. It can make the average activation mean value of neurons close to zero and at the same time, which suggests that it is more robust to noise. However, because of the exponential form, the calculation complexity is relatively higher.

Swish
Swish is a self-gated activation function proposed by Prajit et.al. [28], who attempted to use an automated search technique to find novel activation functions to replace the ReLU function without changing the network architecture. By a combination of exhaustive and reinforcement learning-based search, they found a number of novel promising activation functions and named the best one of them as Swish.
The Swish function can be written as: where β is a constant or trainable parameter.

Benchmark Dataset Overview
The case study focuses on the investigation of the influence of various practical options of optimizers, activation functions and other parameters like sequence length and neuron number when adopting RNN and its variants on RUL prediction. We selected the NASA's Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) dataset, aiming at modelling the damage propagation of aircraft gas turbine engines [29]. This engine simulator produced four datasets which are consisted of three operational condition indicators. Each subset has different numbers of engines with varied operational cycles.
In the dataset, engine profiles were simulated with different initial degradation conditions. The maintenance was not considered during the simulation. The dataset includes one training set and one testing set for each engine. The training set consists of the historical runto-failure measurement records of the engines from 21 on-board sensors. The objective is to predict the RUL of each engine based on the given sensor measurements. The information of the four subsets is listed in Table 1. Specifically, FD001 refers to the engine failure arising from the high-pressure compressor under a single operating condition. FD002 refers to the engine failure from the high-pressure compressor under six operation conditions. FD003 refers to the engine failure from both highpressure compressor and fan under a single operating condition. FD004 refers to the engine failure from both high-pressure compressor and fan under six operation conditions. In this study case, FD001 was used because the data volume is relatively small, therefore it is more time-efficient to do the test.

Data pre-processing
The raw sensor data were normalised to [0,1]. No dimension reduction and feature extraction have been taken place in this case study and the entire sensor data stack was used as inputs for training. In addition, since there is no target output in raw datasets, the RUL was labelled at every cycle for each sample before training the models.

Performance Evaluation
In this case study, the mean square error (MSE) was used to evaluate the performance of the trained neural networks. The mathematical expression is: where n is the total number of true RUL targets in the related test set and d i refers to the difference between the true RUL and the predicted RUL.
The RNN algorithm and its variants were tested with the dataset FD001, and three different layer structures for each method were used. Each algorithm and structure have been tested five times to achieve the statistical result, which was illustrated in the form of a box chart. The results were presented and discussed according to four main factors: optimizers, activation functions, neuron numbers and sequence lengths against three assessment criteria: stability, precision and accuracy. The ranks for precision and accuracy for these four factors will be presented. As for the stability, if the network can produce a reliable result, it will be marked as 1, otherwise, it will be marked as 0.

Optimizers
Different neural network structures were tested with the fixed activation function of ReLU, the neuron number of 128, the sequence length of 50, four different optimizers including Rmsprop, Adam, AdamGrad and AdamDelta. The prediction results are displayed using box plots so that the stability, precision and accuracy of these optimizers can be evaluated. As indicated in Figure 6, gradient exploring or gradient vanishing took place when adopting AdaGrad in RNN_2LAYERS, RNN_3LAYERS, LSTM_2LAYERS, LSTM_3LAYERS and GRU_3LAYERS and AdaDelta in RNN_LAYERS, Bi_LSTM_3LAYERS. This observation suggests that RMSprop and Adam are less sensible to the parameters than AdaGrad and AdaDelta, which means they are more workable in this case. More specifically, AdaGrad and AdaDelta are more likely to lose their stability when the network gets more complicated. In terms of accuracy, generally speaking, AdamGrad can help to achieve the most accurate prediction result in most network structures, regardless of the stability. As for Rmsprop, the change in the structure layers would make a great difference to the prediction performance. In contrast, this influence can hardly be seen when adopting Adam and AdaDelta as the optimizers. As for the precision, Rmsprop has the worst performance among these four optimizers where the other three can all produce relatively precise outcomes.
The assessment of the four optimizers have been made for all network structures such as the example set in Table 2, and all the optimal optimizers have been summarized in Table 3. In this case, AdaGrad can be regarded as the optimal optimizer for most of the network structures.

Activation Functions
In this section, the evaluation of five activation functions is performed with the fixed optimizer (Adam), neuron number (128) and sequence length (50). Both Sigmoid and Tanh functions have also been tested, but these two activation functions were found to be greatly  affected by the gradient vanishing and gradient exploring problem. Therefore, these two functions are not discussed in this section. As demonstrated in Figure 7, the performance of ReLU, Leaky_ReLU, PReLU and ELU is quite similar, and they are generally better than Swish in both precision and accuracy. However, gradient exploring, or gradient vanishing occurred when adopting ReLU, PReLU and ELU in RNN_1LAYER and GRU_3layer, which suggests that Swish and Leaky_ ReLU are more stable than these three activation functions. The Optimal activation functions in this case for different algorithms are listed in Table 4.

Sequence Length
In this section, the impact of different sequence length on the prediction result has been compared with a fixed optimizer (Adam), activation function (ReLU) and neuron number (128). As indicated in Figure 8, generally the longer the sequence length uses, the better performance the algorithms achieved. In this case, gradient vanishing happened when adopting GRU_3LAYERS network structure which suggests that the choice of the sequence length may also affect the workability of GRU. The optimal sequence length for different algorithms is listed in Table 5 considering the workability, precision and accuracy.  Figure 9 shows the influences of different neuron number has on the performance of RUL prediction with a fixed optimizer (Adam), activation function (ReLU) and sequence length (50). Gradient exploring, or gradient vanishing occurs in this case when using network structure RNN_1LAYER and all GRU structures which may suggest that neuron number is a sensitive parameter for RNN and GRU in terms of stability. The performance of different neuron number varies significantly using different algorithms. Taking the LSTM network structure as an example, the influence of different neuron numbers is smaller when using LSTM_1LAYER than the other two. In addition, for LSTM_3LAYERS, the observation shows that the more neuron number is used, the less accurate the result turns out to be, while this tendency cannot be found in the other two network structures. As only three different neuron numbers were tested, the optimal neuron for each network structure could not be achieved. Nevertheless, the optimal neuron number for different algorithms in this case is listed in Table 6 just for reference. Figure 10 demonstrates the performance of different algorithms using a certain group of parameters. As in this case, gradient exploring, or gradient vanishing occurred when using RNN_1LAYER and GRU_3LAYRES. It seems that generally, the performance of LSTM, Bi_LSTM and GRU network structures seems to be relatively close and significantly better than RNN. The accuracy of LSTM is close to Bi_LSTM and GRU, but the precision is relatively poor. GRU turns out to be very accurate and precise, but it suffers from stability problems. Thus, a Bi_LSTM structure might be a better option in this case. A more detailed comparison of different algorithms is displayed in Table 7. The optimal parameters (with the base parameters) for this subset are highlighted using a yellow hatch for every network structure. Although the global optimal parameters cannot be selected for the dataset based on this table since it has not considered all combinations, it provides a level of useful options with certainty.