In Section 2, we have provided the outline of martingale-test for change detection. Two problems have to be further considered for long-term machinery monitoring:
-
(1)
How to deal with multi-change detection in long-term monitoring?
-
(2)
Is it possible to adaptively compute the threshold value when making change decision?
In the following, we will discuss these two problems in details and provide our proposed schemes.
3.1 Change Detection Using Incremental Sliding-Window
Problem 1. As shown in Eq. (3) and Eq. (5), \(M_{t}\) can be sequentially processed with a fixed-length L sliding-window over the given data stream, and all possible change candidates \(t \in \{ 1,2, \ldots ,n\}\) are tested. This process however may be unsuitable for long-term monitoring applications. A key feature of real machine operations is temporal variations, i.e., one operation can last for a long time or only a few seconds. Hence, it is difficult to use a fixed-length L sliding-window to capture transitions (i.e., changes from an operational state to another) in long-term monitoring. More specifically, a small length of L causes over-change-detection and a large length of L causes a large delay. To overcome this problem, we combine the martingale with an incremental sliding-window strategy [27] to design a real-time change detection algorithm for Eq. (5).
Proposed scheme: By virtue of incremental sliding-window, the length L can be automatically updated depending on whether a change is detected or not at time t:
$$\begin{aligned} &{\text{If}}\;t\;\;{\text{is no change }}:\;\;n_{t + 1} = n_{t} ,\;\;\;L_{t + 1} = L_{t} + \Delta L, \hfill \\ &{\text{If}}\;t \, \;{\text{is a change }}:\;\;n_{t + 1} = t,\;\;\;L_{t + 1} = L_{1} , \hfill \\ \end{aligned}$$
(6)
where \(n_{t}\) is the starting time when computing the current martingale and \(L_{t}\) is the length of corresponding sliding window at time t. The process starts with \(n_{1}\) = 1 and \(L_{1}\) = 1, and ends with \(n_{t} + L_{t} > n\) where n is the length of a given data stream in off-line applications or ends at an pre-defined stopping time in on-line applications. Here, it is worth mentioning that ∆L is an increasing step to update the sliding window and was set as 1 in the following experiments.
3.2 Adaptive Threshold for Change Detection
Problem 2. When making change decision by testing the null hypothesis shown in Eq. (5), the threshold of \(\lambda\) is essential as it balances the detection precision and recall (their definitions will be given in Section 4.3). In general, the value of \(\lambda\) is pre-defined empirically or confirmed by a prior estimation for change detection. It is, however, often difficult to confirm the optimal value in real-world applications. To address this problem, unlike existing works that directly used the original monitored variables for change detection (e.g., Refs. [6,7,8,9,10,11,12, 25]), we utilize the Hilbert space embedding of distribution (HED, also called the kernel mean or mean map) to map the original data \(\{ x_{i} \}\), \(i \in \{ 1,2, \ldots ,n\}\) into the Re-producing Kernel Hilbert Space (RKHS) (see Figure 1). Without going into details, the idea of using HED for change detection is straightforward. By this, the probability distribution is now represented as an element of a RKHS, and the change can be thus detected by using a well behaved smoothing kernel function, whose values are small on the data belonging to the same pattern and large on the data from different patterns.
Proposed scheme: Inspired by Ref. [28], probabilistic distributions can be embedded in RKHS. The center of the HED are the mean mapping functions:
$$\begin{aligned} \mu (p_{x} )= E (k ( {{\{ }}x_{i} {{\} ))}}, \hfill \\ \mu ( {{\{ }}x_{i} {{\} )}} = \frac{1}{t}\sum\limits_{i = 1}^{t} {k (x_{i} )} , \hfill \\ \end{aligned}$$
(7)
where \(\{ x_{i} \}\)
i = 1,2,…,t are assumed to be I.I.D sampled from the distribution \(P_{x}\). Under mild conditions, \(\mu \;(P_{x} )\)(same for \(\mu \;(\{ x_{i} \} )\)) is an element of the Hilbert space. Mapping \(\mu \;(P_{x} )\) is attractive because each data point \(x_{i}\) has a one-to-one correspondence with mapping \(\mu \;(P_{x} )\). Thus, we can use the function norm \(s(\mu (P_{x} ),k(x_{t} ))\)(instead of \(s(P\{ x_{1} ,x_{2} , \ldots ,x_{t - 1} \} ,x_{t} )\)) used in Ref. [1]) to quantify the strangeness value \(s_{t}\) for \(x_{t}\). We do not need to access the actual distributions but rather finite samples to calculate/estimate the distribution \(\rho\).
Lemma 1. As long as the rademacher average [29], which measures the “size” of a class of real-valued functions with respect to a probability distribution, is well behaved, finite sample yield error converges to zero, thus they empirically approximate \(\mu (P_{x} )\)(see Ref. [28] for more details).
The success of kernel methods largely depends on the choice of the kernel function k which is chosen according to the domain knowledge or universal kernels. In this paper, we employ the widely-used Gaussian radial basis function (RBF) kernel by
$$k(x_{i} ) = \exp \left( { - \frac{1}{{2\sigma^{2} }}||x_{i} - \bar{x}||^{2} } \right),$$
(8)
where \(\bar{x}\) and \(\sigma\) are the sample mean and standard deviation of the data stream \(\{ x_{1} ,x_{2} , \ldots ,x_{i} \} .\) We next construct \(s_{t}\) to measure the strangeness of \(x_{t}\) to the past data stream up to time t–1, i.e., \(\{ x_{1} ,x_{2} , \ldots ,x_{i - 1} \}\) in RKHS, as
$$s_{t} = s(\mu (p_{x} ),k(x_{t} )) = \;|k(x_{t} ) - k_{t - 1}^{c} |,$$
(9)
where \(k_{t - 1}^{c}\) is the kernel center of the data stream, and \(\left| \bullet \right|\) is distance metric. Here, it is worth mentioning that in real engineering scenarios, the CM variables are often composing of multidimensional values measured from multiple sensors at each time instance, we thus use the Mahalanobis distance [30] to compute the strangeness \(s_{t}\), by considering correlations between variables such that different patterns in each dimension can be identified and analyzed [30], computed by
$$s_{t} = \sqrt {(k(x_{t} ) - k_{{t - 1}}^{c} )^{\prime } \sum\nolimits_{{}}^{{ - 1}} {(k(x_{t} ) - k_{{t - 1}}^{c} )} }$$
(10)
where ∑ is the covariance matrix.
Since we used RBF as the kernel function as given in Eq. (8), an isolated data point can be certified if \(s_{t} \ge \alpha \cdot \sigma\), where \(\alpha\) is a fixed-global factor controlling the confidence level of detection and \(\sigma\) is the standard deviation computed from existed data (that is, an adaptive threshold).
Based on this fact, kernelized change decision can be made by re-writing Eq. (5) as follows,
$$\begin{aligned} {\text{If}}\;\;0 < M_{t} < \alpha \cdot K \cdot \sigma_{t - 1} :\;\;{\text{no}}\;{\text{change,}} \hfill \\ {\text{If }}M_{t} \ge \alpha \cdot K \cdot \sigma_{t - 1\;} :\;\;{\text{change}}\;{\text{occurs,}} \hfill \\ \end{aligned}$$
(11)
where K is a projection coefficient of data from RKHS to martingale space and \(\sigma_{t - 1}\) can be computed adaptively from the data stream up to time \(t - 1\). In real implementation, the employed Gaussian function is often standardized as a normal Gaussian function (i.e., \(\mu\) = 0 and \(\sigma\) = 1). Figure 2 gives typical confidence levels corresponding to different α in Gaussian distribution. Thus \(K\) can be fixed as \(K \approx 2.17\) by an off-line estimation. The estimation is made as follows: (a) given a set of data streams containing changes, we first make a definition of detection accuracy as q = N/P where N is the number of correctly detected changes and P is the number of total detected changes, and set up a threshold value \(\lambda^{*}\) to guarantee a perfect accuracy, i.e., q = 100%; (b) then, decrease the value of \(\lambda^{*}\) gradually to make sure q not decrease; (c) once q decreases, K is computed by
$$K\; = \;\frac{{\lambda^{*} }}{5 \cdot \sigma }$$
since five times of σ can guarantee all changes can be detected.