Challenges and Opportunities of AI-Enabled Monitoring, Diagnosis & Prognosis: A Review

Prognostics and Health Management (PHM), including monitoring, diagnosis, prognosis, and health management, occupies an increasingly important position in reducing costly breakdowns and avoiding catastrophic accidents in modern industry. With the development of artificial intelligence (AI), especially deep learning (DL) approaches, the application of AI-enabled methods to monitor, diagnose and predict potential equipment malfunctions has gone through tremendous progress with verified success in both academia and industry. However, there is still a gap to cover monitoring, diagnosis, and prognosis based on AI-enabled methods, simultaneously, and the importance of an open source community, including open source datasets and codes, has not been fully emphasized. To fill this gap, this paper provides a systematic overview of the current development, common technologies, open source datasets, codes, and challenges of AI-enabled PHM methods from three aspects of monitoring, diagnosis, and prognosis.


Introduction
As the key ingredient in the modern industry, mechanical equipment, such as helicopters, high-speed rail, aero engines, etc., is chronically operating in an increasingly harsh environment and its structure is becoming increasingly complex as well, which may result in sudden equipment failure, long maintenance cycles, high maintenance costs, and large downtime losses. Different from traditional maintenance methods (corrective maintenance and periodical maintenance), Prognostic and Health Management (PHM) uses the integration of advanced sensors as well as various intelligent approaches to monitor the status of the mechanical system, which realizes timely and optimal maintenance via reducing manual labor, spares, and maintenance cost.
PHM mainly consists of monitoring, diagnosis, prognosis, and health management [1,2], whose relationships are summarized in Figure 1. Monitoring refers to fault detection, and the purpose is to determine whether the system is in a normal operating state, in which anomaly detection is one of the most important tools to trace the corresponding health state. Diagnosis refers to the identification of the fault type and its corresponding degree. Prognosis makes use of appropriate models to assess the degree of performance degradation and further predicts the remaining useful life (RUL). Health management integrates outputs from monitoring, diagnosis, and prognosis and makes optimal maintenance and logistic decisions via considering economic costs and other available resources. In general, PHM will greatly improve the operational safety, system reliability, and maintainability of equipment, and reduce the cost of equipment throughout its life cycle at the same time.
Traditional maintenance methods generally rely on experts to observe and diagnose equipment artificially and determine the fault type and its location by reasonably mounting the sensors and analyzing the result using appropriate algorithms. This type of method increases manual labor, and the efficiency of maintenance largely depends on expert experience. With the development of the sensor technology, a large number of sensors are installed on mechanical equipment to

Open Access
Chinese Journal of Mechanical Engineering collect multi-source data, including vibration, temperature, images, etc., which provides a base preparation for potential implementation of PHM. However, due to the fact that mechanical equipment is chronically operating in an extremely complex environment, the measured signal often contains heavy background noise, that is, the related fault features are often submerged in the interference. Traditional signal processing methods, such as FFT (fast Fourier transform) and simple metric construction, cannot extract and analyze the feature information with high efficiency and precision. Advanced signal processing methods, such as sparse representation [3,4] and timefrequency analysis [5], often have some parameters that need to be adjusted carefully, resulting in huge workload. With the development of big data techniques and artificial intelligence (AI) algorithms, AI-enabled PHM is becoming increasingly popular and has already achieved wide success in both academia and industry. The main superiority of AI-enabled PHM is that it can perform monitoring, diagnosis, and prognosis at a high level of automation, and requires little intervention and expert knowledge. AI-enabled PHM is mainly about using traditional machine learning (ML) or deep learning (DL) methods to perform the final health management. Traditional ML algorithms such as K-nearest neighbor (KNN), artificial neural network (ANN), support vector machine (SVM), etc., have been successfully applied to PHM and also have achieved considerable progress. However, their applications, to a large extent, still depend on hand-crafted feature extraction. As long as the extracted features can represent fault features effectively, traditional ML models can also establish the mapping between features and the mechanical health status successfully. However, hand-crafted feature extraction still relies on expert knowledge, which also differs considerably between different signals or equipment. Moreover, when handling massive heterogeneous data, these methods based on hand-crafted features are obviously time-consuming, and such experience-oriented methods are easy to drop their accuracy in the context of big data. Therefore, it remains a challenging problem about how to establish AI-enabled PHM with high efficiency and precision. Since Hinton et al. [6] first proposed and realized a DL model in 2006, DL has become a subversive technology in AI. DL has achieved a significant breakthrough and extensive applications in a wide range of fields, especially computer vision and natural language processing. In 2015, Nature organized a special issue to deeply summarize the development process of AI and took DL as one of the six breakthrough technologies in this field [7]. Because of its strong representation learning ability, DL is very suitable for automatic data analysis, which can establish the mapping from the data side to the task side via learning the representation features automatically from a large number of data. Consequently, the application of DL in PHM is becoming increasingly popular because of providing a technology with the potential to process a large number of data, extract features from high-dimension data, and form an "end-to-end" monitoring, diagnosis, and prognosis system automatically.
To explain the popularity of AI-enabled PHM, we conducted a literature search using Web of Science with a database called the web of science core collection in the past five years. It is worth mentioning that it is impossible to cover all the related papers because the names of AI-enabled algorithms are often different. As shown in Figure 2, we can observe that research about AI-enabled PHM has increased rapidly and it is of great importance to embed AI into PHM. To summarize the research of AI-enabled PHM, many scholars published their review papers from a different angle. Hamadache et al. [8]   The basic descriptor is \TI= ((AI OR artificial intelligence OR machine learning OR support vector machine OR SVM OR data-driven OR deep OR autoencoder OR convolutional network* OR neural network*) AND (fault detection OR fault isolation OR fault diagnosis OR intelligent diagnosis OR prognosis OR residual useful life prediction OR condition monitoring OR health management)) Page 3 of 29 Zhao et al. Chin. J. Mech. Eng. (2021) 34:56 introduced the fundamentals of PHM techniques for rolling element bearings (REBs) and reviewed contemporary techniques including modern AI techniques and DL approaches for fault detection, diagnosis, and prognosis of REBs. Lee et al. [9] detailed previous and on-going efforts in PHM for rotary machinery systems, and introduced a systematic PHM design methodology and its applications. Ellefsen et al. [10] reviewed four DL-based techniques applied to PHM for autonomous and semiautonomous ships. However, these reviews mainly discussed the applications of PHM for a specific object. Liu et al. [11] and Lei et al. [12] reviewed applications of AI techniques to machine fault diagnosis. Lei et al. [13] and Zhang et al. [14] reviewed recent advances on machinery prognostics systematically. These papers mainly focused on one aspect of monitoring, diagnosis, and prognosis. Khan et al. [15] and Fink et al. [1] provided a systematic review of the DL and its applications in PHM. However, there is neither emphasis on the importance of anomaly detection in monitoring nor the summary of open source datasets and codes in these papers. Therefore, it needs a review to cover monitoring, diagnosis, and prognosis based on AI techniques with the emphasis on DL and the requirement of an open source community.
To fill the aforementioned gap, this paper systematically reviews the current development, common technologies, open source datasets, codes, and challenges of AI-enabled PHM from three aspects of monitoring, diagnosis, and prognosis. We focus on the applications of AI-enabled algorithms, especially DL in monitoring, diagnosis, and prognosis. More importantly, we emphasize the importance of open source datasets and codes for the benign development of the research community of AI-enabled PHM. Last but not least, this paper provides some promising future directions in the field of AI-enabled PHM. It is worth mentioning that this review paper does not cover another important part of PHM, e.g., health management.

Introduction to AI-Enabled Monitoring
As the key and basic task of PHM, monitoring of machinery has not attracted enough attention, as shown in Figure 3. What is more, many existing studies about monitoring are based on the supervised methods [16][17][18][19]. It means that both normal data and anomaly data are required for model training, which is usually inconsistent with the scenario of monitoring, since anomaly data with faults is not always available, and the form and location of the failure are even unknown. Thus, methods relying on the existing faults would fail when confronting the new fault, which would result in the catastrophic missing alarm. In this section, we specially define the monitoring task as anomaly detection and review the related papers.

Anomaly Detection
The generalized concept of anomaly detection can be divided into three categories, including supervised learning, semi-supervised learning, and unsupervised learning. As described in the previous sub-section, supervised learning methods are not suitable for the monitoring task, and data from the healthy state is often available. So in this paper, anomaly detection refers to semi-supervised anomaly detection in the following discussion, specifically.
Semi-supervised anomaly detection can be regarded as a one-class classification problem, which means that only the data at the health state is available. The goal of anomaly detection is to detect the fault that may occur in the future based on the existing data. The failure would occur on any component of the machinery with any external manifestation, so it is a typical open-set task.
According to the different strategies of anomaly determination, anomaly detection methods for monitoring are divided into three categories in this paper, including distance-based methods, model-based methods, distribution-based methods (also called density-based methods), hybrid methods, and others. We will review these methods in the following subsections.

Distance-Based Methods
Distance-based methods pay attention to the distance between data collected on the anomaly state and health state. The distance to be measured is calculated in the signal space or in the latent feature space after feature extraction. It is based on the assumption that collected data at the health state would be close to each other in the signal space or in latent feature space, and the collected data at the anomaly state would be naturally far away from the former data. Various metrics can be applied for distance calculation, including Euclidean distance, Manhattan distance, cosine distance, Chebyshev distance, etc. (2021) 34:56 Meanwhile, to consider the contributions of different features to the distance calculation and the compactness of the feature space, plentiful pre-processing and representation learning strategies can also be applied before the distance calculation. In Ref. [20], a comb filtering was applied to smooth the original signal, Gini-guided residual singular value decomposition and Principle Component Analysis (PCA) were used for feature extraction, and then iterative Mahalanobis distance was calculated to obtain an anomaly score. In Ref. [21], Short-Time Fourier Transform (STFT), Hidden Markov Model (HMM), and dimension reduction were applied to the original signal for feature extraction and a distance-based strategy was used for anomaly score calculation. In Ref. [22], a classical one-class classification method, Support Vector Data Description (SVDD) was applied together with a Genetic Algorithm (GA) for parameter optimization. An improved SVDD with artificially generated outliers was proposed for rolling element bearings detection [23]. Multi-sensor data was utilized in Ref. [24] with a correlation-based anomaly detection method for predictive maintenance. In Ref. [25], self-organizing map and KNN were used for cooling fan bearing monitoring. In Ref. [26], the K-means cluster method was utilized to obtain cluster center points, and an anomaly score was calculated based on the distance from center points. In Ref. [27], one class SVM was utilized for kinematic chain monitoring using data processed by Laplacian score ranking guided features selection.

Model-Based Methods
Model-based methods try to establish a prediction model to reflect the intrinsic regularity between parameters or on the timeline based on the health state data. It is assumed that, when an internal or external failure occurs on equipment, the intrinsic regularity of the data would deviate from the original model. So the occurrence of anomaly can be represented by the degree of deviation between the model prediction and actual data.
In Ref. [28], a Long Short-Term Memory (LSTM)} model was trained based on features extracted by Stacked AutoEncoders (SAE) to predict the vibration signal in next N time steps. The residual between predicted and actual signals was utilized to indicate the occurrence of anomaly for rotary machinery. In Ref. [29], Generative Adversarial Networks (GAN) was trained to discriminate the fake data from real data, and the output of the discriminator was further regarded as the anomaly indicator. Similarly, the reconstruction model based on LSTM and GAN was trained with normal data in Refs. [30,31], and the anomaly score was influenced by the output of the discriminator and the reconstruction error simultaneously. In Ref. [32], AutoEncoders (AE) based GAN was trained to generate artificial normal data. Then the test data was fed into AE to get reconstructed latent features and reconstructed signals. Finally, an anomaly score was calculated based on the reconstruction error. In Refs. [33,34], the Yet Another Segmentation Algorithm (YASA) was utilized for data segmentation, and the segmentation results were fed into one-class SVM for offshore oil extraction turbo machine anomaly detection. In Refs. [35,36], a LSTM prediction model was trained, and the residual between predictions and actual signals was utilized as an anomaly score for water treatment system and aircraft anomaly detection, respectively. In Refs. [37][38][39], an AE model for data reconstruction was built and the reconstruction error was used as the anomaly indicator. In Ref. [40], a rotary speed to vibration regression was trained to predict the vibration signal, and the residual was further used for anomaly detection. A comparison between autoregressive-based models and network-based models was implemented in Ref. [41] for wind turbine fault detection. In Ref. [42], the HMM model was trained for screw compressors anomaly detection after dimension reduction by PCA. In Refs. [43,44], the autoregressive integrated moving average (ARIMA) process was proposed for data prediction and anomaly detection with multi-sensors. In Ref. [45], the AE model was trained with data processed by the series-to-image transform for anomaly detection.

Distribution-Based Methods
Distribution-based methods (density-based methods) try to estimate the distribution of normal data. It is assumed that anomaly data would be subject to a different distribution from normal data. So if the anomaly data is input into the distribution model, a low probability will be obtained. This type of method can also be understood as density-based methods. Anomaly events can be regarded as low-probability events and they exhibit low density characteristics in the sample space. Therefore, without distribution model estimation, the sample density of the test data can also represent the probability of an anomaly.
In Ref. [46], a multivariate Gaussian distribution model was built after a set of feature extraction processes, including Hodrick-Prescott (HP) filtering and Gradient of Change (GoC), for anomaly detection. The Gaussian distribution was also used in Ref. [47] after feature extraction with SAE for gas turbine engine gas path anomaly detection. Generalized Extreme Value (GEV) distribution was applied in Ref. [48] for power generation monitoring. The martingale-test was performed in Ref. [49] to detect the change point of gearboxes based on the graph model. In Ref. [50], correlation coefficients of segmented signals were calculated and the derivation of the anomaly score (2021) 34:56 was based on the Probability Density Function (PDF) of correlation coefficients.

Hybrid Methods
In order to break the limitation of a single algorithm, some other research took the advantages of the above methods and constructed hybrid methods for anomaly detection.
In Refs. [51,52], model-based and distance-based methods were combined for anomaly detection. In Ref. [51], a PCA matrix was obtained for dimension reduction based on the normal data, and the residual of PCA was calculated as the input of the SVDD method for fault detection. Similarly, the model based on AE and LSTM was combined with SVDD in Ref. [52] for bearings initial fault detection. In Refs. [53,54], model-based and distribution-based methods were combined for anomaly detection. In Ref. [53], the probability of the anomaly state was defined as the combination of the reconstruction error and the latent feature with an AE model, and a fault-attention factor was implied to re-weight the anomaly score. In Ref. [55], a set of anomaly detection methods were compared, including Gaussian Mixture Model (GMM), Parzen window density estimation, Local Outlier Factor (LOF), k-means clustering, PCA-based methods, and SVDD-based methods. In Ref. [56], a Hierarchical Temporal Memory (HTM) model was built and the distribution of the model prediction error was estimated for real-time, continuous, online detection of streaming data.

Others Methods
Different from methods we described above, some researchers paid more attention to representation learning or the relationship between equipment groups and provided novel perspectives for condition monitoring.
In Ref. [57], advantages of equipment groups were taken by utilizing a clustering algorithm on electrical machine fleet after domain specific pre-processing. It is similar to the density-based methods, the clustering objects changed from the samples of time scale to the different equipment of space scale. But it also put forward a high requirement for the consistency of operation states of different equipment in the group. In Ref. [58], a non-parametric k 2 decomposition method was used to isolate the fault from multi-variate processes by measuring the relative contribution of an individual variable. In Ref. [59], an AR-based model was proposed to detect the condition change between adjacent periods, but this method was not applicable to non-stationary working conditions. In Ref. [60], speed-energy spectra of the jet engine were calculated to reflect the operation state, and the difference between spectra was regarded as the anomaly indicator. In Ref. [61], a hybrid feature selection method based on ReliefF and an adaptive GA was proposed, and recursive one-class SVM was trained after pre-processing with Extended Kalman Filter (EKF) to realize an online undated detection for chillers. In Ref. [62], dictionary learning was used, and the change of the dictionary was monitored for condition monitoring of rotating machinery.

Date Type Summary
Among different monitoring scenarios, data used for anomaly detection is diverse. Although the photograph of equipment can directly reflect the healthy condition of equipment, it is hard to obtain surface photographs of all the components of equipment and most of the key components are invisible due to the complexity of the structure. A better alternative is the vibration signal of the rotating machinery, in which the global health information for entire equipment can be obtained simultaneously. Meanwhile, vibration signals are also the most commonly used data in recent research work. Besides, temperature signals [24,51,61], electric current signals [25,57], sound signals [49,59], pressure signals [24], speed signals [40,41,63], voltage signals [64], and acoustic emission signals [62] are also used as the source data for monitoring. What is more, parameters related to equipment operation conditions are also used in many large scale equipment [29,31,36,47].

Open Source Datasets
Most of the methods proposed above are evaluated on real industrial data, but there is no way for performance comparison due to the privacy of datasets. So open source datasets are necessary for performance comparison between different methods. Here, we summarize a list of open source datasets utilized in existing papers.
Many researchers used fault classification datasets, such as CWRU datasets [65], Tennessee Eastman Process datasets [66], and SEU datasets [67], for anomaly detection. The categories with the healthy condition are regarded as normal data and the categories with fault occurring are regarded as anomaly data. The run-tofailure datasets, such as IMS datasets [68] and PHM2012 datasets [69] were also used in condition monitoring task with artificial division of normal and anomaly states based on the degradation state of components.
There are also some datasets specially made for monitoring. For example, Airbus Helicopter Accelerometer dataset [70] used in Ref. [45] was collected by Airbus SAS with vibration measurements of helicopter in different directions (longitudinal, vertical, and lateral). Numenta Anomaly Benchmark [71] with over 50 labeled real-world and artificial time series data files was used in Ref. [56]. Secure Water Treatment (SWaT) dataset [72] was established for the research of the protection of Cyber Physical Systems (CPS) such as those for water treatment, power generation and distribution, and oil and natural gas refinement, and was widely used for the performance evaluation of anomaly detection method. The details of these datasets are listed in Table 1.

Open Source Codes
Although a large number of methods have been proposed for monitoring, a few source codes of these methods are publicly available, which is not conducive to the sustainability of research. In this subsection, we summarize the online available codes of related papers to provide a convenient way for researchers to get started in this field and more open source studies are also required in the future.
The whole code of Fault-attention Generative Probabilistic Adversarial Autoencoder based on Pytorch framework proposed by Ref. [53] for anomaly detection with SEU datasets was released online in Ref. [73]. Numenta Platform for Intelligent Computing (NuPIC), proposed by Ref. [56], based on HTM learning algorithms for anomaly detection and prediction of streaming data sources was online available in Ref. [74]. Multivariate Anomaly Detection with GAN (MAD-GAN) framework proposed by Ref. [31] was also online available in Ref. [75].

Challenges
All of the above methods are more or less based on certain assumptions, but in real-world applications, these assumptions would fail due to the inherent characteristics of equipment or the complexity of the external environment. To realize reliable and accurate monitoring and abnormal early warning in industrial applications, more realistic challenges need to be considered. The main challenges hindering the implementation of anomaly detection in reality can be summarized as follows.

Balance between Recall and Precision
During the construction of anomaly detection models, the choice of the decision threshold or the decision boundary is an inevitable process. It is essentially a tradeoff between recall and precision. Low recall would lead to the omission of fault and cause catastrophic damages. Low precision would cause excessive and unnecessary maintenance and inspection, and the associated cost of condition monitoring will increase. Therefore, the optimal threshold can be chosen via comparing the cost of omission and the cost of over-maintenance and minimizing the expected cost.

Unified Benchmark Datasets
Although there are many datasets which can be used for anomaly detection research, benchmark datasets specially designed for AI-enabled monitoring are still required. Unified benchmark datasets can provide a normalized data processing flow and make performance comparisons between different models more convenient. Meanwhile, realistic anomaly detection datasets for monitoring can make related research based on these datasets practically significant.

Quick Alarm for Early Failure
In the stage of early failure for machinery, only weak fault features can be shown in collected data and there is little difference between the signal characteristics under the health state and the fault state. To quickly detect early failure and avoid the fault extension, the sensibility of methods for early failure is of the utmost importance. The existing methods do not take this perspective into consideration.

Adaptability under Variable Working Conditions
In reality, the operating condition of equipment, such as rotary speeds, loads, and temperatures, is variable and the corresponding data characteristics would also change accordingly. An ideal monitoring algorithm should be robust for different operating conditions. However, due  34:56 to the complexity of equipment, it is impossible to train the model with data under all possible operating conditions. Thus, models are required to recognize unseen operating conditions out of the training sets and do not make a false alarm under unseen operating conditions.

Introduction to AI-Enabled Diagnosis
Fault diagnosis plays an important role in exploring the relationship between measured data and machine health states [76,77], which has been the research hot-spot of PHM. Traditionally, the relationship is found via expert knowledge and engineering experience. However, in engineering scenarios, people would like to shorten the maintenance cycle and improve the diagnostic accuracy through an automatic method. Especially with the help of AI, fault diagnosis is expected to become smart enough to automatically detect and identify health states.
The AI-enabled diagnosis aims to diagnose health states with applications of ML theories, such as SVM [78,79], ANN [80,81], and deep neural networks (DNN) [82,83]. These methods utilize ML theories to capture features hidden in measured data with less expert knowledge. It attempts to build a bridge that automatically detects health states from the collected data. In recent years, AI-enabled methods have swept the field of mechanical fault diagnosis [84,85]. It has been widely used to solve problems in fault diagnosis, such as class imbalance, variable working conditions, and fault diagnosis under strong background noise. Therefore, in this section, we mainly review how the existing AI-enabled diagnostic methods solve these problems.

Vanilla Fault Diagnosis
Generally, when we explore intelligent diagnosis algorithms, the used datasets will not encounter the issues of class imbalance, low signal-to-noise ratio, and variable operating conditions. We call the diagnosis under this situation vanilla fault diagnosis. AI-enabled diagnosis algorithms are mainly divided into two categories: traditional ML-based methods and DL-based methods. DL-based methods are closer to the expectations of automatic fault classification as shown in Figure 4. It can extract features automatically without human intervention, and can establish the relationship between the learned features and the fault patterns. Therefore, in the following section, we will mainly review DL-based algorithms that are widely used in fault diagnosis.

(1) AE-based methods
In the past five years, AE has made tremendous development in the field of PHM. AE has a strong ability to learn feature representation, and by inputting the extracted features into the classifier, fault diagnosis can be realized. For example, Lu et al. [86] introduced stacked denoising autoencoder (SDAE) to fault diagnosis, and Liu et al. [87] proposed a rolling bearing fault diagnosis method by using SAE to extract features and adopted a fully connected layer to classify the fault modes. Ma et al. [88] proposed a deep coupling AE to achieve multimodal data fusion and fault diagnosis. Shi et al. [89] proposed a fault diagnosis method based on SAE, which integrated compression sensing and wavelet

) RNN-based methods
Due to the one-way non-feedback connection of DNN, they cannot learn the temporal dependencies containing in the signal. Recurrent neural network (RNN) can store the data information (short-term memory) of the most recent periods in the form of excitation, which makes them suitable for processing time series. For example, Liu et al. [138] proposed a low-speed lightweight RNN, which has a small storage space occupancy rate and low calculation delay. Miki et al. [139] proposed a LTSM-based method for time-series analysis and a training method for weakly supervised training. Rao et al. [140] proposed a many-to-many-to-one bi-directional LSTM to automatically extract the rotating speed from vibration signals. Shao et al. [141] proposed a method based on the enhanced deep gated recurrent unit and the complex wavelet packet energy moment entropy for early fault diagnosis of bearings. Shi et al. [52] presented a fault diagnosis framework based on SDAE and LSTM, which can effectively detect initial anomalies of rolling bearing and accurately describe the deterioration trend. To improve the diagnostic accuracy, Zhang et al. [142] presented an attention-based equitable segmentation gated recurrent unit network, which consists of an equitable segmentation approach and an improved deep model.

Fault Diagnosis under Imbalanced Dataset
During the operation of the machine, the collected datasets are often highly imbalanced, which contain many samples in the normal state but a paucity of samples from the fault state. Facing the imbalanced datasets, intelligent fault diagnosis approaches are biased towards the major classes and hence show very poor classification accuracy on the minor classes.
From current research, there are three ways to solve this problem, including the data synthesis methods, designing a powerful feature extractor, and designing the corresponding loss function, which are summarized as follows.
(1) Data synthesis based methods Data synthesis based methods are the most direct way for solving the class-imbalanced problem. Traditional data synthesis methods are SMOTE [143] and cost sensitivity based methods [144]. For example, Razavi-Far et al. [145] used an imputationbased oversampling technique for class-imbalanced learning and the proposed scheme was evaluated on three experimental scenarios with different imbalance ratios. Zhang et al. [146] adopted a weighted minority oversampling strategy to balance the data distribution, and used a data synthesis strategy to avoid generating incorrect or unnecessary samples.
Recently, GANs [147] have been applied to generate artificial data for the minor classes or for data augmentation [148,149]. For example, Mao et al. [150] used GAN to generate synthetic samples for minority fault classes and improved the generalization ability of the fault diagnosis model. Luo et al. [151] proposed a conditional deep convolutional GAN. By using the conditional auxiliary generative samples as the input, fault diagnosis under the imbalanced dataset was achieved. Zhang et al. [152] utilized GAN to learn the mapping between the distributions of noise and real machinery temporal vibration data, and then used the generated samples to balance the minor classes. Wang [160] proposed an adaptive channel weighted CNN (ACW-CNN) and used Focal loss for condition monitoring of the helicopter transmission system function. With the help of Focal loss, the ACW-CNN could reduce the weight of easily classified categories and increase the weight of categories that were not easy to classify, so that the model could pay more attention to the minor classes. Xun et al. [161] proposed a deep cost adaptive CNN based intelligent classification method for imbalanced data, which used the cost adaptive loss function to adaptively assign different misclassification costs for all categories.

Fault Diagnosis under Variable Working Condition
Fault diagnosis under variable working conditions is still a challenge due to the domain discrepancy problem. To achieve fault diagnosis under variable working conditions, there are currently two widely adopted methods, that is, discriminative feature extraction based methods and transfer learning based methods.
(1) Discriminative feature extraction based methods Designing a DNN model that can extract discriminative features is a common way for intelligent diagnosis under variable working conditions. For example, Peng et al. [162] proposed a multi-branch and multi-scale CNN to learn discriminative features from multiple signals and time scales of vibration signals. Qiao et al. [163] also proposed an adaptive weighted multi-scale CNN to adaptively extract robust and discriminative multi-scale features from raw vibration signals. With the help of these extracted features, the model could achieve supe-  [167], that is, the model-based transfer learning methods, instancedbased transfer learning methods, mapping-based transfer learning methods, and adversarial-based transfer learning methods. Model-based transfer learning methods mean that the model first uses the data in the source domain for pre-training, and then fine-tunes the partial network parameters using the data in target domain. Hasan et al. [168] used a discrete orthonormal Stockwell transform to process the raw signal, and trained a CNN model with the obtained vibration images under different working conditions. Then partial parameters of pretrained CNN were frozen and transferred to the target network for fault diagnosis. Du et al. [169] employed STFT to transform bearing vibration signals to timefrequency images, and used the processed data as the input of a deep residual network. Then, the modelbased transfer learning strategy was used to achieve the high performance in another working condition. He et al. [170] trained the model using sufficient auxiliary data in the source domain and used multi-wavelet as an activation function for discriminative feature extraction, and then the model parameters were transferred to the target domain. Wu et al. [171] proposed a model based few-shot transfer learning method by considering the variability of working conditions and the scarcity of fault samples in the real working condition. Shao et al. [172] developed a novel DL framework using transfer learning and the pre-trained network was fintuned by time-frequency images of vibration signals.
Instance-based transfer learning methods explore the way to reweight instances in the source domain to improve the diagnostic accuracy or align the distribution between the target domain and source domain. For example, Zhang et al. [173] used wide kernels in the first layer to extract more informative features and used small convolutional kernels in the latter layers for the multi-layer nonlinear mapping. Xiao et al. [174] trained a CNN with data from the target domain and source domain, and used a modified TrAdaBoost algorithm to update the weight of each training sample to form a stronger diagnostic model.
Mapping-based transfer learning methods refer to mapping the data from the source and target domains into the same feature space. For example, Azamfar et al. [175] and Singh et al. [176] used a DL-based domain adaption method for intelligent fault diagnosis by minimizing the cross-entropy loss in the source domain and maximum mean discrepancies between the source and target domains, simultaneously. Che et al. [177] and An et al. [178] used multi-kernel maximum mean discrepancies to match features between the source and target domain, and optimized with a combined transfer learning method. Qian et al. [179] reduced the input dimension by sparse filtering, and proposed a joint distribution adaptation to align the data distribution of the source and target domain, which helps capture discriminative features. Li et al. [180] proposed a representation clustering algorithm to minimize the distance between intra-class and maximize the distance between the inter-class simultaneously, and domain adaptation was used to adapt the maximum mean discrepancies between source and target domains. Li et al. [181] used knowledge mapping to explore domain-invariant knowledge between the source domain and the target domain, which helps to obtain a powerful feature extractor.
Adversarial-based transfer learning methods refer to using adversarial training to enable the domain discriminator to reduce the feature distribution of the source and the target domain, which makes the feature extractor can extract more robust features [182,183]. For example, Lu et al. [184] and Han et al. [185] used adversarial domain adaptation to train the proposed DNN to extract representative information. Xu et al. [186] used adversarial domain adaptation to train a two-branch network to extract domain-invariant features, and used a scaled exponential linear unit activation function for the nonlinear activation.

Fault Diagnosis for Low Signal-to-Noise Ratio Signals
In real industrial scenario, the fault patterns are often overwhelmed by heavy background noise. As a result, algorithms with excellent performance under ideal conditions are often severely degraded in practical applications, showing weak generalization ability. Therefore, it is necessary to develop some advanced methods to enhance generalization ability of current algorithms. According to the current publications, there are two mainstreams to address this issue, that is, robust feature extraction based methods and building robust models. In real industrial scenario, the fault patterns are often overwhelmed by heavy background noise. As a result, algorithms with excellent performance under ideal conditions are often severely degraded in practical applications, showing weak generalization ability. Therefore, it is necessary to develop some advanced methods to enhance generalization ability of current algorithms. According to the current publications, there are two mainstreams to address this issue, that is, robust feature extraction based methods and building robust models.
(1) Robust feature extraction based methods AE has strong ability for feature extraction, recently, researchers have developed many AE variants, such as deep auto-encoder (DAE), SDAE, and contractive auto-encoder (CAE), to automatically extract high-level representative features from data collected under the noisy environment.
For example, Chen et al. [187] used a deep SAE trained with Gaussian noise to avoid over-fitting and learned more robust features from a noisy working environment. Guo et al. [188] employed the SDAE to denoise random noise and to extract fault features from the vibration signals. Jiang et al. [189] proposed a feature learning approach named stacked multilevel-denoising AE, which is able to learn more robust and discriminative fault features to improve diagnosis accuracy on vibration signals with abundant noise. Shen et al. [190] constructed a stacked CAE model to extract more robust features than a standard stacked AE. Wang et al. [191] proposed a hybrid method by combining GAN and SDAE, where SDAE was used as the discriminator of GAN to automatically extract effective fault features from input samples and to discriminate their authenticity. Liu et al. [192] trained an 1D denoising convolutional AE model with noisy signals to perform fault classification. Qi et al. [193] combined SAE and CAE to obtain sparser and robust features under noise interference. Zhang et al. [194] designed a deep CAE to automatically learn invariant feature representation from raw signals. (2) Constructing robust model Extracting highly robust features from low signalto-noise ratio signals are time-consuming and labor-intensive. Therefore, it is necessary to establish an end-to-end fault diagnosis model with high robustness. For example, Gan et al. [195] proposed a hierarchical diagnostic network, which stacked multiple DBN layers to overcome the overlapping problem caused by noise or other interference. Shao et al. [196] proposed an improved convolutional deep placement network with compressed sensing to improve the generalization performance of the constructed deep model. You et al. [197] proposed a hybrid technique, which used CNN as feature extraction under noise environment and SVM as the classifier. Zhang et al. [173] proposed a deep CNN with wide first-layer kernels, which used the wide kernels to extract features and to suppress high-frequency noise. Zhang et al. [198] designed a deep CNN with new training methods to achieve pretty high accuracy in a noisy environment. Peng(a) et al. [199] constructed a deep residual learning network, which can adaptively learn the deep fault features from the original vibration signals to achieve high diagnostic accuracy under a strong noise environment. Peng(b) et al. [200] proposed a deep CNN to identify the failure modes of rotating vector reducer under strong background noise. Zan et al. [201] presented a fault diagnosis model based on a multi-dimension input CNN, which used multiple input layers to fuse the original signal and to learn the signal characteristics automatically for improving recognition accuracy and anti-jamming ability. Jin et al. [202] designed an adaptive anti-noise DNN framework to deal with the diagnosis problem under heavy noise without manual feature selection or denoising procedures. Peng(a) et al. [162] proposed a multi-branch and multi-scale CNN that could automatically learn and fuse abundant and complementary fault information from high complexity, strong coupling, and low signal-to-noise ratio vibration signals.

Open Source Datasets
In the field of AI-enabled fault diagnosis, it is quite difficult to obtain high-quality datasets from real industrial scenarios and it also lacks open source codes. Fortunately, some institutions have released the datasets and codes for research and applications. Therefore, we collect these commonly used datasets and the description of these datasets are listed in Table 3.

Open Source Codes
There are relatively a few open source codes for intelligent diagnosis. In this subsection, we summarize some online available codes of related papers as follows. A CNN-based method for bearing fault diagnosis was provided by Ref. [207], and in Ref. [208], the author released a code for rolling bearing faults. In Ref. [209], the author released an interpretable DNN for industrial intelligent diagnosis. In Ref. [210], the author released a multireceptive field graph convolutional network for machine fault diagnosis. In Ref. [172], the author released a code for few-shot transfer learning for intelligent fault diagnosis of machine. In Ref. [168], the author released a unified intelligent fault diagnosis library based on unsupervised deep transfer learning and provided the corresponding comparative study. Besides, in Ref. [211], the author provided the baseline (lower bound) accuracy and released a unified intelligent fault diagnosis library based on various DL-based models. In Ref. [212], a CNN based on LeNet-5 was proposed for fault diagnosis.

Challenges
AI-enabled diagnosis has achieved great development, it releases the dependence of manpower and can automatically identify the health states from the past to the present. However, there are still some issues that need to be further discussed. In this section, we attempt to discuss the challenges and give some feasible solutions.

Interpretability
Interpretability helps users understand the results generated by the model. A main limitation of AI-enabled methods in mechanical fault diagnosis is that they operate as a "black box" and are not interpretable, which does not offer insight into how and why they can make the final decision. To bridge the gap, there are two research interests worthy of further study: (1) Most of the current AI-enabled diagnosis algorithms are migrated from the field of image processing and lack expert knowledge in the field of fault diagnosis. Therefore, we can combine prior knowledge commonly used in fault diagnosis to design our network. For example, we can design a convolution kernel that can extract useful features in vibration signals [209], or design a network structure that can be interpreted. (2) We can combine signal processing methods or traditional ML algorithms with DL algorithms to obtain a deep model with interpretable output. Sparse coding [213] may be a good choice to achieve this goal.

Transfer Learning
Transfer learning based methods have achieved a breakthrough in fault diagnosis under variable working conditions. However, there are still some challenges that need to be further discussed: (1) The backbones of transfer learning based algorithms are often different, which makes it difficult to directly compare the results, and the impact of different backbones has not been thoroughly studied. (2) If the assumptions related to the source and target domains are invalid, transfer learning based algorithms might use diagnosis knowledge from the source domain to carry out a negative transfer, thereby reducing the transfer performance of the model.

Class Imbalance and Few-Shot Learning
In real engineering scenarios, the collected data, especially for the key components, is far from the big data and the amount of data is highly imbalanced, which makes it difficult to train AI-enabled models. Although, there are many algorithms to solve a class-imbalanced problem, it is still difficult to synthesize data with only a few samples. Therefore, how to use few-shot learning to solve the imbalanced problem still needs to be further discussed.

Introduction to AI-Enabled Prognosis
Prognosis aims to evaluate the current health state of the equipment, which is known as degradation assessment (DA) and predicts its future failure time, which is known as remaining useful life (RUL) estimation, so as to provide the basis for subsequent predictive maintenance. In the industry, the operating condition of the critical equipment is highly concerned, as its sudden shutdown or failure would bring huge economic losses, and even endanger the life safety of operators. Compared with the traditional scheduled maintenance strategy, the prognostic based maintenance strategy provides proactive decision making capability that can effectively avoid downtime and costs, improve manufacturing productivity, and more importantly, provide early warning for catastrophic system failure. According to the literature statistics, the prognosis methods generally fall into four groups, i.e. physicsbased, statistics-based, data-driven, and hybrid methods. Physics-based methods usually rely on dynamic modeling, such as the finite element model [214] and simulation [215], etc., to calculate the dynamic response and degradation process of the system with a given input. However, physics-based methods require accurate mathematical models and expert knowledge about the specific system, which is difficult to implement on complex mechanical equipment. Statistics-based methods commonly assume that the RUL of equipment obeys an empirical distribution, such as a Weibull distribution [216]. It is worth noting that statistics-based methods need data to update the parameters of the empirical distribution to fit the degradation process of the device, which is in fact data-dependent. Data-driven methods mine the characteristics of the device degradation process from the historical run-to-failure data to identify the degradation pattern of current equipment. The hybrid methods are formed by the combination of the above three methods, thus obtaining the corresponding merits.
The focus of this section is to review the data-driven DA and RUL estimation methods based on AI, especially those based on DL. Since the research areas of DA and RUL estimation partially overlap, this section will summarize these problems from different horizons, to provide more diverse information and discussions. For the former aspect, a hierarchical overview is given by categories of DA methods. For the latter aspect, the motivations of the RUL approaches are discussed. Additionally, a brief introduction to the open source datasets and codes will be given since we believe the open source behavior will drive the prognostic community to grow rapidly. Last but not least, to provide more accurate information for the predictive maintenance, many pain points deserve attention, so the challenges of prognosis will be given at the end of this section.

Overview
Mechanical equipment usually has four states: normal state, performance degradation state, maintenance state, and decommissioning state. From the deterioration of equipment performance to the complete failure of equipment, it usually goes through a series of different performance degradation stages. DA of mechanical equipment is to synthesize the state indexes of mechanical equipment, evaluate the degree of performance degradation, formulate maintenance plan, and make targeted treatment. Scholars in the field of mechanical equipment health management have done a lot of research on DA. This subsection will review the methods of DA for mechanical equipment, which can be divided into two categories: traditional ML-based methods and DL-based methods, and summarize the merits and shortcomings of these methods simultaneously.

Traditional ML-Based Methods
Since ML algorithms play an important role in most respects of DA, scholars engaged in prognosis have carried out a lot of research in this area. From the analysis of the experimental results, the ML techniques, such as data dimension reduction, feature fusion and pattern recognition, etc., are very effective for DA problems. (1) Fuzzy C-means clustering Tong et al. [217] proposed a bearing DA model based on information theory metric learning and fuzzy C-means (FCM) clustering. The constructed degradation index showed superior performance.
To solve the instability problem of bearing in the initial stage of operation, Liu et al. [218] proposed a method based on wavelet packet decomposition and autoregressive (AR) model to calculate the entropy of the health factor index, through FCM of bearing performance degradation process. Zhou et al. [219] proposed a rolling bearing DA method based on auto-associative neural network (AANN) and FCM. The features were extracted by wavelet packet decomposition and AR, and the features after dimension reduction were input into AANN. Then, the difference between the output vector and the input vector of AANN was input into FCM as the feature vector. In order to find the bearing fault in real-time, Zhou et al. [220] proposed a method based on wavelet packet Tsallis entropy and FCM to evaluate the performance degradation state of bearings. (2) HMM Jiang et al. [221] proposed a bearing DA method based on HMM and nuisance attribute projection (NAP). Aiming at the problem of poor robustness in DA, Jiang et al. [222] proposed a method of NAP based on student t-hidden Markov model, and removed interference components from performance degradation features by NAP. Hu et al. [223] proposed a first-order Markov state space model. For better expression in the state space model, the degraded state was transformed into PDF which formed HMM and Bayesian recursive estimation mechanism. Wang et al. [224] proposed a method of bearing DA based on hierarchical Dirichlet process (HDP)-HMM, in which HDP was used to obtain the state number of equipment in operation and HMM was used to evaluate performance degradation. To establish the index with an obvious trend, Li et al. [225] proposed the negative log likelihood probability based on the two-dimensional HMM as the bearing performance degradation index, showing the sensitivity to weak defects. Liu et al. [226] proposed a bearing DA method based on orthogonal local preserving projection (OLPP) and continuous HMM. The continuous HMM was used to train the data after dimension reduction by OLPP, and then the performance could be evaluated quantitatively by calculating the logarithmic likelihood of the data.
(3) PCA To adapt the application of signal decomposition and feature extraction to wind turbine under high background noise, Pan et al. [227] proposed a DA method of vibration signal denoising fusion performance based on complete ensemble EMD with adaptive noise and kernel PCA. Ma et al. [228] proposed a DA method based on multi-sensor information fusion, which was extracted by the proposed method, to extract features and to establish the relevant DA model. Feng et al. [229] proposed a bearing DA method based on integrated EMD and PCA, which showed good effect in denoising and degradation evaluation. (4) SVDD Wang et al. [230] proposed a DA method of the rolling bearing based on VMD and SVDD. The characteristic vectors combined with VMD singular values, root mean square values, and sample entropy values, were selected as the evaluation indexes of the degradation degree, and then the performance degradation index of the test samples could be obtained by SVDD. Zhou et al. [231] proposed a bearing DA method based on lifting wavelet packet symbolic entropy (LWPSE) and SVDD. The SVDD was trained by fitting the hyper-sphere around the normal samples, and then the relative distance between LWPSE and the hyper-sphere boundary of the test signal was calculated as the bearing DA indicator.

(5) Clustering
Ding et al. [232] used manifold learning to extract features, achieved the comparison between abnormal data and health data, and calculated the feature clustering index to evaluate the degree of performance degradation. Tiwari et al. [233] proposed a DA method based on local mean decomposition (LMD) and spectral clustering to solve the problem of the high-dimensional feature space in rolling bearing DA. LMD was used to decompose signals, and spectral clustering was used to classify features. Lu et al. [234] proposed a compact Gaussian mixture clustering algorithm based on complementary ensemble EMD, which could distinguish the scattered features and obtain better DA results. Zhang et al. [235] proposed a bearing DA model based on the multi-scale entropy and K-medians clustering. The multi-scale entropy of bearing vibration signals was extracted from the original data, and the test data was input into the established K-medians clustering model, then the bearing failure degree could be quantitatively evaluated by the membership degree of the model output. Wang et al. [236] proposed a bearing DA method based on the basic scale entropy and Gath-Geva fuzzy clustering. Gath-Geva fuzzy clustering was used to divide the degradation stage and further to evaluate the degradation degree of bearing performance. Akhand et al. [237] proposed an evaluation method of bearing performance degradation based on EMD and K-medians clustering. The K-medians clustering was applied to features extracted from bearing signals by EMD, and then the dissimilarity between the test data and normal state was taken as the bearing performance degradation index. (6) Hybrid methods Zhou et al. [238] proposed a performance degradation evaluation method of the wind turbine bearing based on HMM and FCM. The FCM and HMM models were constructed via using the features extracted by wavelet packet and AR, and could better describe the decline trend of the bearing. (7) Other methods Prashant et al. [239] proposed a method for evaluating the performance degradation of ball bearings based on curve component analysis and selforganizing mapping (SOM) network, which was more sensitive to weak degradation. Akhand et al. [240] proposed a kind of bearing DA index based on SOM. The time-domain and frequency-domain features were extracted from the original bearing vibration signals and input into the SOM classifier to achieve the degradation metric by minimizing the quantization error of SOM. Because the global trend of the signal could not accurately reflect the running state of the rolling bearing, Zhu et al. [241] proposed a bearing DA method based on the improved fuzzy entropy. The baseline part was not removed when calculating the fuzzy entropy, but used as the index of DA of rolling bearing. Qin et al. [242] proposed a method based on segmentation vote and SVM. LMD and PCA were used to obtain effective indexes of bearing performance degradation.

DL-Based Methods
Many scholars verified the effectiveness of DL algorithms, such as CNN and RNN, in DA. Here, we review DL-based methods from four perspectives, including CNN, RNN, hybrid methods, and other methods.
(1) CNN For better DA, Zhang et al. [243] proposed a method for health index (HI) construction based on deep multi-layer perceptron CNN. In order to improve the DA of rolling bearings, Dong et al. [244] proposed a method based on the DAE, t-distribution stochastic neighborhood embedding (t-SNE) and improved CNN. The features were constructed by the DAE and t-SNE, and the degree of bearing performance degradation was characterized by Mahalanobis distance. In order to solve the problem of outliers in HI, Zhang et al. [245] proposed a method combining deep convolution inner ensemble learning with outlier removal to evaluate the degradation degree of bearing performance. The deep convolution internal integration learning was used to extract features from the original vibration signals, and then outlier removal based on the sliding threshold was used to remove outliers in HI. Guo et al. [246] proposed a method of bearing HI construction based on deep convolution feature learning, which used convolution kernels to extract features from the original vibration signals, and mapped the features into HI through the nonlinear transformation. (2) RNN Akpudo et al. [247] proposed a LSTM model, and the root mean square (RMS) statistical features in time domain were used as the key features to evaluate the degree of bearing degradation. Zhang et al. [248] proposed a bearing DA method based on RNN, which evaluated the bearing performance degradation degree through the waveform entropy index, and identified the bearing running state via inputting the waveform entropy index into RNN. Cheng et al. [249] proposed a DA method based on adaptive kernel spectral clustering (AKSC) and RNN. The DA method constructed a DA feature based on Euclidean distance, and used AKSC and RNN to identify machine faults. Shi et al. [52] proposed a bearing failure DA method based on SDAE. SDAE was used to reconstruct the rolling bearing signal processed by the sliding window. LSTM was used to predict the vibration value of rolling bearing in the next cycle based on the reconstructed signals. Meanwhile, the performance degradation degree of a bearing was evaluated by the reconstructed error.

(3) Hybrid methods
Wang et al. [250] proposed a structure based on CNN and LSTM. CNN was used to extract local features of the original sensor, and LSTM was used to extract the sequence features of the original signals. H-statistics calculated by d-statistics and q-statistics were used to evaluate the performance degradation of rolling bearings. (4) Other methods Xu et al. [251] proposed an improved unsupervised deep trust network model named median filtering DBN. The absolute amplitude of the original vibration signals was used as the direct input for less dependence on the artificial experience. Pan et al. [252] proposed a DA method based on DBN and SOM, which defined the minimum quantization error as the HI of early fault detection of the wind power transmission. Tu et al. [253] proposed a method combining ANN and AR to evaluate the degradation degree of rolling bearing. ANN was used to evaluate the performance degradation degree of bearing, and AR was used to evaluate the bearing performance according to the bearing DA results. Gai et al. [254] proposed a bearing DA method combining EMD-SVD (singular value decomposition) and the fuzzy neural network. Li et al. [255] proposed a DA method based on DNN and wavelet packet decomposition. After extracting wavelet coefficients and energy features from vibration signals, DNN was used to predict the performance degradation degree of rotating machinery.

RUL Estimation
Several definitions of RUL estimation have been introduced in Refs. [256][257][258]. To avoid confusion, this paper followed the definition from the International Standard Organization, that the RUL estimation is defined as the estimation of the time to failure. Datadriven RUL estimation methods can be divided into two strategies: matching-based and regression-based. For the former one, the library of HI (also called degradation index) needs to be constructed offline firstly, and then the HI is calculated and matched from the library [259][260][261][262][263][264]. The key point of the matching-based RUL estimation is to construct a monotone, smooth and obviously trending HI library. Since the construction of HI has been discussed in the previous subsection in detail, matching-based RUL estimation methods will not be included in this subsection. The regression-based data-driven RUL estimation methods mainly leverage the historical run-to-failure data to model the degradation process of equipment, and then evaluate the health status of the current operating equipment.
As shown in Figure 5, the regression-based datadriven RUL estimation framework generally includes the following aspects: 1) Data acquisition: This procedure is to collect and save data measured by appropriate sensors.
2) Data processing: Data needs to be cleaned at this stage for a higher data quality, and common methods include denoising and interpolation. 3) Feature extraction and selection: Features are extracted and selected to reflect the degradation trend. Traditional methods often use statistical features, such as Kurtosis, RMS, while DL-based methods can use unsupervised learning to extract deep features automatically. 4) States partition: This step is to divide the run-to-failure data into the health state and degradation state, which is also called the determination of first predicting time [265] or fault occurrence time [266]. 5) HI construction: This procedure is to fuse the previously extracted features to construct a monotone, smooth and obviously trending HI. Some methods take HI as the label of the regression model, while others directly take the RUL as the label. There-fore, HI construction is not necessary for the RUL estimation. It is worth noting that there may be a confusing concept that the HI is not the same as the RUL, and there is still a mapping relationship between them.
The HI is usually represented as a curve with jitter, while the RUL is represented as a linear function or piecewise linear function. 6) Model building and optimization: This process is to build the regression model offline based on historical data, and then the trained model will be deployed online. 7) Performance evaluation: This procedure is to evaluate the performance of the model. For the RUL estimation, common evaluation metrics include mean square error (MSE), root mean square error (RMSE), mean absolute error (MAE), mean relative error (MRE), scoring function (SF) and some other variants. Evaluation metrics guide the training of the model. Elsheikh et al. [267] proposed a safety-oriented metric which was biased towards the earlier estimation. Therefore, an appropriate evaluation metric should be designed according to various application scenarios.

Traditional ML-Based Methods
The RUL estimation based on traditional ML methods has been developed for more than 20 years and a variety of traditional ML methods, like SVM [268][269][270], HMM [271,272], and ANN [273][274][275][276], have been widely studied to solve this task. Since the theories and applications of these technologies are relatively mature, there have been many excellent reviews for traditional ML-based RUL estimation [13,14]. Moreover, this section focuses on DL-based RUL estimation, so only the recent MLbased RUL prediction methods are briefly introduced. Traditional ML-based RUL estimation generally consists of two steps. The primary step is to extract high-quality features from the original signals. The secondary step is to train the regression model based on the extracted features. Since the model theory is mature to a large extent, the difficulty of applying the model lies in how to match the specific task, to achieve better accuracy and efficiency, and to interpret the results.
In addition to the aforementioned classic methods, some scholars have done valuable works based on the pain points of the RUL estimation task. Considering the small sample size of run-to-failure data, transfer learning was used to solve the problem of inconsistent data distributions [266,277]. Zhang et al. [278] proposed a prognostic method based on the dynamic Bayesian network with mixture of Gaussian output to deal with missing data in real scenarios. Compared with DL-based methods, traditional ML-based methods are more interpretable. However, due to the limitation of the model capacity, it is difficult for traditional ML-based methods to fit the massive and high-dimensional data.

DL-Based Methods
There have been several insightful reviews of the RUL estimation based on DL, such as Ref. [279]. Most of literatures were classified and reviewed according to the types of DL models, which made the paper hierarchical, but also easy to create an illusion. Although different DL-based models can accomplish specific tasks, being addicted to various DL models tends to ignore the RUL estimation task itself. Therefore, this paper will not make a classification review according to the categories of DL models, but according to the motivations.
(1) Spatial-temporal feature extraction DL-based methods can realize the automatic feature extraction and easily combine the extracted features with the subsequent regression model to construct an end-to-end optimization pipeline. As a result, most of DL-based RUL estimation methods leverage its powerful spatial-temporal feature extraction ability. These methods extract spatialtemporal features from the time domain, the frequency domain, and the time-frequency domain, in which spatial-temporal features benefit from CNN and RNN, respectively.
Original signal: It is feasible and convenient to directly input the original signal into DL models for the RUL estimation. Li et al. [280] directly input multi-source sensor signals into a LSTM. The CNN also showed the potential of processing raw signals for the RUL task [281]. With the exploration of the ability of 1D CNN to extract local features of time series, some scholars found that the spatial-feature extraction capability of CNN and the temporal-feature extraction capability of RNN can be obtained simultaneously via connecting the structures of CNN and RNN sequentially [282,283]. For better feature representation, more complex parallel networks have been designed. Li et al. [284] built two feature extraction branches with LSTM and CNN respectively, and then modeled the fusion features with another LSTM. Thanks to CNN and LSTM, these methods combined spatial and temporal features by designing a network structure to automatically extract features from the original signals.
Benefiting from the end-to-end training mode of DL methods, the feature extraction networks and prediction networks are optimized synchronously, which improves the efficiency and accuracy of fault prognosis.
Signal processing: Although the sliding window sampling strategy preserves partial non-stationary relations between different windows, the calculation of statistical features will lose non-stationary information in a single window. Zhao et al. [285] extracted manual features to form the feature sequences, and then modeled the sequences with a Gated Recurrent Unit network. Although more non-stationary relations can be retained by reducing the window length, the feature quality and the window length are against each other, and the dense sliding window will bring a huge computation. Therefore, the introduction of time-frequency analysis is a natural idea, as this technology could relatively preserve the non-stationary relationship within a single sliding window, such as envelope spectrum [286], discrete wavelet transform [287], continuous wavelet transform [288], and STFT [289,290]. In addition to these classical timefrequency analysis techniques, there were also some methods to extract frequency domain features directly [291][292][293]. These methods combined the advantages of both signal processing and DL, including more domain priors, and performed well in the RUL estimation task.
Unsupervised learning: Unsupervised learning can extract a good data representation without label information, and with the efforts of a large number of papers, the performance of unsupervised learning on some tasks has approached or surpassed that of supervised learning.
Thus, in addition to the above supervised learning methods to specifically extract features, unsupervised learning can also obtain excellent feature representation, mainly including DBN [294,295], AE, and its variants [244,[296][297][298][299]. (2) Multi-modal and multi-task Different types of data (temperature, pressure, vibration, etc.) from the same equipment are collected by several sensors simultaneously, and different sensors reflect various condition information, which requires the model to be able to process multi-modal data. As the task division of PHM becomes more and more elaborate, modeling each task separately is time-consuming, and there may be a problem that the decision results by two models for highly similar tasks are inconsistent, so using a unified model to accomplish multiple tasks is a promising approach. Multi-modal learning and multi-task learning both share a common trunk model respectively, despite their different motivations. For the former one, the trunk model requires multiple sources of the input to extract redundant features from a broad perspective, and this strategy ensures the model can work effectively in the absence of some modal data. For the latter one, the trunk model has multiple output branches, each of which corresponds to a specific task, while the trunk model provides a shared feature subspace. Multi-modal learning: In most methods with algorithm validation on C-MAPSS dataset, multi-sensor data was basically used as the input of the trunk model. However, the concept of multi-modal was vague in these papers, and this concept was only highlighted in a few papers [300,301]. The C-MAPSS dataset had 21 sensors and 3 operational conditions (altitude, Mach number, and sea-level temperature), so it was natural to consider multimodal on this dataset. For other specific tasks, the multi-modal data based approach was less popular because collecting additional data often meant higher costs. Herp et al. [302] proposed a prognostic model for the wind turbine main bearing based on multi-modal data (actual wind speed, temperature, active power, etc.). He et al. [303] considered 6 sensors data and 5 operational setting data in the RUL estimation task for the ion mill etching flowcool system. Multi-task learning: The purpose of multi-task learning is to explore the common feature subspace between tasks with a joint model. Miao et al. [304] completed the RUL estimation task together with the DA task, and trained a trunk LSTM network via weighting two loss items. Liu et al. [305] combined the RUL estimation task with the fault recognition task based on a trunk CNN model. Aggarwal et al. [256] discussed the relationship between shortterm failure prediction and long-term RUL estimation, arguing that a joint model could prevent inconsistent results.

(3) States partition
The difficulty for the RUL estimation is to determine the label for the input at each time step. Some methods take the HI as the label for the regression model. However, there is still a mapping relationship between the HI and the actual RUL, and since most HIs are not completely monotonous, one HI point may correspond to several RUL points. Therefore, many methods directly use the actual RUL as the label, such as linear or piece-wise linear functions. If the degradation mechanism of a physical system is unknown, it is natural to use the linear function due to the fact that the actual RUL of the system decreases step by step. However, the piece-wise linear function will be more consistent with the data distribution if we know the degradation mechanism. The run-to-failure data typically contains the health state and degradation state of the equipment. Generally, the distribution of health state data is concentrated, while the distribution of degradation state data tends to be scattered [291].
To provide corresponding labels for each state, the piece-wise linear function assumes that the expected value of the RUL in the health state is con-  34:56 stant and that in the degradation state is linearly decreasing.
The challenge for the piece-wise linear function is to determine the degradation occurrence time, which means how to divide run-to-failure data into the health state and the degradation state. Most methods rely on empirical rules to select an appropriate threshold, such as observing the trend of HI. However, the selection of a threshold mainly depends on expert experience. Mao et al. [297] calculated the Pearson correlation coefficient between the health state and the degradation state based on features from SDAE. Although the step of feature extraction was skipped, the setting of a threshold was still inevitable. Therefore, some scholars considered how to reduce the impact of expert knowledge on states partition to improve the generalization ability of the RUL model. Li et al. [265] regarded the state partition as an anomaly detection problem and used GAN to learn the health state data distribution, and then determined the first predicting time. Xia et al. [306] divided the run-to-failure data into different degradation stages and then classified these states by DL models. Yang et al. [307] also regarded states partition as a classification problem, but added more rules of engineering experience, such as rapid or slow degradation patterns, and a "3/5" principle. States partition is still challenging for the RUL estimation task as it is impossible to construct a common model or rule for all devices and application scenarios. (4) Transfer learning In general, the RUL estimation task constructs an offline model on existing historical run-to-failure data and then makes an online prediction for a new object. However, collecting large amounts of run-to-failure data in the real-world scenario is expensive and time-consuming. For industry, major equipment would not be allowed to operate in a near-failure condition, which means it is also difficult to collect complete degradation process data. As a result, the RUL estimation task naturally faces a few-shot sample and generalization problem. Therefore, some scholars have conducted research on this issue based on transfer learning. In the relatively early study, Zhang et al. [308] adopted the fine-tuned strategy, that the model was pretrained with a related dataset with a large data size, and then the pre-trained model was fine-tuned in the target dataset with only few samples. Sun et al. [296] proposed a deep transfer learning method based on SAE using three transfer strategies (weight transfer, feature transfer, and weight update) for the tool RUL estimation. Mao et al. [297] aligned the features of source and target domains by the transfer component analysis. Yu et al. [309] proposed a transfer learning method to reduce the distribution discrepancy between source and target domains based on maximum mean discrepancy for the RUL estimation. Meanwhile, the feature alignment strategy based on adversarial learning for RUL estimation has been applied in Refs. [265,310]. (5) Uncertainty modeling If the cost and risk of decision need to be considered, it is necessary to estimate the uncertainty of the RUL estimation. Sankararaman et al. [311] argued that the traditional source classification of uncertainty (physical variability and lack of knowledge) may not be applicable to the RUL estimation, so he proposed a different classification method, that the sources of uncertainty in the RUL estimation were as follows: the current unknown state of the system, future uncertainty (i.e., the loading, operating, environmental, and usage conditions), modeling uncertainty, and the actual data distribution. Purposeful modeling can be performed if the factors are clearly known. The wandering setpoints, input current, and fault magnitude were used to model the uncertainty for the sensor prognosis based on Gaussian process in Ref. [312]. Wiener process is also a common uncertainty modeling method [313,314]. If the uncertainty factors are too complex to be recognized, latent modeling can be carried out, and the mainstream method is based on the Bayesian theory. Peng et al. [315] and Wang et al. [316] proposed to quantify the uncertainty of the RUL estimation using the Bayesian DL network based on Monte Carlo (MC) dropout, which was proved to be an effective Bayesian approximation in Ref. [317]. In addition to the Bayesian method, the sampling-based ensemble learning can also realize the uncertainty modeling and quantification for the RUL estimation [318,319]. By training the sub-models with various sub-datasets, an ensemble learning method can be formed, and the actual distribution of few-shot data can be estimated effectively. In fact, latent modeling avoids the challenge of identifying the source of uncertainty, but it becomes more difficult to interpret the uncertainty. Sankararaman et al. [311] discussed in detail the significance, interpretation, and quantification of uncertainty in the RUL estimation, and then compared several methods of uncertainty propagation.

Open Source Datasets
The AI-enabled prognosis requires a large amount of high-quality run-to-failure data, which is difficult to satisfy in real-world scenarios. There are several factors as follows: 1) The device will not be allowed to operate near the failure time for security and economic reasons, which means that the full degradation process is rare; 2) Different conditions and fault types will produce various degradation processes, and the cost of traversal experiments is obviously unacceptable; 3) Individual differences lead to inconsistencies in the distribution of data from historically failed equipment and those from current operating equipment, and it is expensive to perform experiments on a large number of subjects.
Fortunately, as shown in Table 4, several mechanical prognosis datasets have been shared from a few institutions, and NASA collected some of the open source datasets to build the Prognostics Center of Excellence (PCoE) database [320]. Additionally, the open source datasets provide a baseline standard for validation of various algorithms.

Open Source Codes
The open source code behavior in the prognosis field is very necessary and important to promote the theoretical research and application, and will also have a positive effect on the upstream and downstream tasks of PHM. However, there are only a few open source projects for prognosis because of the difficulty of this research. Oyharcabal et al. [325] coupled the convolution kernel to the operation of RNN and verified it on the C-MAPSS dataset. Lahiru et al. [326] described the overall process of the RUL estimation for C-MAPSS dataset in detail, including data structures, labels, data augmentation, etc. Libera et al. [327] applied Bayesian and Frequentist DL models to the RUL estimation. Chen et al. [328] used attention mechanism to model the importance of extracted features and also released the source codes.

Generalization Ability
As previously mentioned, the few-shot data makes it difficult to accurately predict the RUL of a new object. Transfer learning has been used to enhance the generalization ability of the RUL estimation, but mainly from the feature or model perspective. A natural way to enhance this ability is to increase the data volume. So how to generate more high-quality samples based on the existing data is still a challenge. Although it is impossible to obtain the actual data distribution, we can estimate the data distribution, such as using the resampling strategy or adversarial learning strategy to generate high-quality virtual samples. Additionally, with the digital twin model of a specific mechanical system, a large number of degradation processes under different conditions can be generated by changing the working conditions, fault types, and other variables.

Prognosis in Real-World Scenarios
There are many limitations and uncertainties in the real open world, such as restricted computing resources, variable working conditions, unknown failure modes, etc. The lack of computational power means that AI-enabled prognosis methods cannot be directly applied to the real scenario. Meanwhile, to ensure real-time prognosis, it is necessary to design a lightweight model, and common methods include model compression and pruning. Additionally, the open scenario requires the model to be able to continuously update parameters online, since the data distribution of the open scenario and that of training data are often inconsistent.

Combination of Data-Driven and Model-Driven Methods
As equipment becomes increasingly complex, a single method is usually difficult to accurately evaluate and predict the RUL. Combining multiple models based on data-driven and model-driven methods to establish more effective health indicators, we can make full use of the powerful feature extraction ability based on datadriven methods and the advantages of interpretability of model-driven methods. For example, Yucesan et al. [329] designed a physics-informed layer based on damage increment within deep neural networks to predict wind turbine main bearing fatigue. In addition, Wang et al. [330] proposed a cross physics-data fusion scheme and a loss function which embeds physical discipline for machine tool wear prediction.

Conclusions
In this paper, we mainly review the current development of AI-enabled approaches, especially DL-based approaches in monitoring, diagnosis, and prognosis, which are three essential ingredients of PHM. Besides, we emphasize the importance of open source datasets and codes for the benign development of the research community of AI-enabled PHM. For monitoring, we summarize the main challenges, containing balance between recall and precision, unified benchmark datasets, quick alarm for early failure, and adaptability under variable working conditions. For diagnosis, we conclude that the main challenges are interpretability, transfer learning, class imbalance learning, and few-shot learning. For prognosis, we further summarize the challenges as generalization ability, prognosis in real-world scenarios, and combination of data-driven and model-driven methods. We hope this review paper could provide some valuable discussions of future research and attract enough attention from researchers to the construction of the open source community.