BLIND FAILURE PREDICTION
In the previous article, we saw how to recognize a problem containing cyclic data. Now we are going to deal with a concrete case of a problem containing cyclic data. We will use our feature generation software designed for cyclic problems, and observe what gains such a feature generator brings when used in combination with classical anomaly detection models.
The problem data is extracted from vibration data in a rotating system, here a ball bearing (dataset link).
Getting started with data
The dataset consists of many files, some correspond to healthy measurements, and others to certain types of possible defects. Here, we are going to look at the problem by simply taking the H-A-1.mat file which will contain the reference data for the healthy state, and O-A-1.mat which contains the data related to the fault to be detected. Here there will only be one type of defect, concerning the outer part of the ball bearing.
Each file contains data sampled at 200,000 Hz, for 10 seconds of measurement. Two sensors are available: vibration data from an accelerometer, and the rotational speed of the bearings.
Here, the problem we choose is the following: from the healthy data, build a model allowing to recognize the healthy data and the anomalous data.
To deal with this problem, we will use classic models, available in the machine learning library sklearn. We will use the following 3 models from the library:
The objective is not to go into the details of these methods, or to optimize the models obtained as much as possible. It is rather a question of seeing how an adequate feature generation tool can improve the performance of models for learning data.
However, it is still necessary to look at what are the expected inputs and outputs for these models.
These 3 models, like many models in machine learning, do not work natively on time series. Indeed, these are models that learn to associate 1 input of fixed size, with 1 output. For example, they can associate a 10-second segment of measurements with a “healthy” or “abnormal” label. But they are not designed to follow the state of a system continuously over time, by generating a time series that would be the evolution of the state of health of our system.
A particularity of these models is that they are models used to detect anomalies. That is, they learn using only sound data. Then during the evaluation on new data, a "healthy" or "anomaly" label is returned by the built models. The “anomaly” label is returned if a deviation from the healthy state is detected by the model.
To be able to apply these models to the problem, we will cut the healthy and anomalous data into small segments. These segments will be the inputs to the model, and depending on where the segment is extracted from, each will have its “healthy” or “abnormal” label. The models will be trained using a fraction (here 12.5%) of the healthy data, and will be validated on the rest of the data.
A first possible approach is to perform a division into segments of regular sizes for the healthy data and the data resulting from the fault. Here, we choose to have segments of 10,000 points (thus 10 times the size of Figure 4). Each segment is associated with its label: “healthy” or “abnormal”.
Once this cut has been made, we can apply the 3 models mentioned above in order to categorize healthy data, abnormal data, by learning only on part of the healthy data.
In order to visualize the results obtained synthetically, we will use an ROC curve. It is a classic tool, making it possible to visualize the various false positive rate/true positive rate compromises that can be obtained from a model. False positives correspond to false alarms, and the true positive rate corresponds to the anomaly detection rate. For each model, it is possible to vary an anomaly detection threshold.
To read the results obtained for a model, all you have to do is set an acceptable false positive rate (ex: 0.05 for 5% false positives), and look at the associated detection rate (for example here, 42% for Local Outlier factors). The higher the detection rate for a certain fixed false alarm rate, the better the model.
A value allowing to globally compare the models is the area under the curve (AUC: Area Under Curve). The closer this value is to 1, the better the model.
Without detailing the results, we observe that the best model is the One Class SVM, with an area under the curve of 0.92. Whether such a result is good or not depends entirely on the context in which the model is used and the difficulty of the problem. We are not trying to interpret this value, but simply to use it to easily compare the generated models.
Now that we've established our baseline results, let's see if we can improve them. In our DiagFit software, there is a feature generation module, specially adapted to the cyclic case. However, this tool can be applied to any time series, which is useful in cases where the cyclicality of the data is not obvious. The results obtained by this method are shown in the figure below.
After evaluating the models thus constructed from simple segmentation and cyclic feature generation, we can see that the AUC of the models does not improve. Learning the models on the transformed data therefore did not improve performance.
In fact, the data for this problem is relatively far from an ideal case of cyclic data. Our feature generation tool highlights fine differences in the temporal evolution of the data. In the case of vibrational data, this approach can work, but by nature this data contains a lot of random noise, which can mask finer changes in the temporal evolution of the data.
Moreover, the rotational speed increases in the studied dataset, so each segment built through a regular segmentation in time can contain a variable number of revolutions. Each segment therefore corresponds to a slightly different physical phenomenon, which further distances us from an ideal cyclic case.
Fortunately, it is possible to solve this last problem, by performing a more intelligent segmentation of the data.
Segmentation by cycle
In the previous part, the results were obtained by cutting the data into regular segments over time. This simple method has the disadvantage of creating segments where the number of rotations is not constant.
Fortunately, the data set provides a second piece of information that allows us to construct segments that have the same number of revolutions for the ball bearing studied. Indeed, the second sensor is the rotational speed of the bearing, measured via a tachometer.
From this data, it is possible to make a cut where each segment contains exactly the same number of turns for the bearing (here we choose 1024 turns). The segments are not of a significantly different size: before resampling, the average length of a cycle is 10,092 points (against 10,000 with simple segmentation). The advantage is that each segment obtained via this cut now corresponds to more comparable physical phenomena, which makes it possible to approach the ideal case of cyclic data.
First of all, we can observe that if we directly apply the same 3 methods as in the previous section on these synchronized segments, the gain brought by this segmentation is not obvious:
Indeed, we do not observe any significant change in the AUC of the models. However, by using our tool for generating automatic features in the cyclic case, we observe this time a major gain in the performance of the models, contrary to what had been obtained with a simple segmentation:
Thanks to this regular segmentation in number of laps, and not in duration of the segments, we obtain much more interesting results. The area under the curve of all models is greatly improved. By taking for example a false alarm rate of 5%, we go from a detection rate of 42% to 99% for a Local Outlier Factor type model. The results for the other models are similar, with a large increase in model quality. Addressing the problem using tools specific to cyclic data therefore generated much better results than a basic approach.
Note: Of course, if the objective was to evaluate performance precisely, this evaluation should be done on the entire dataset, with different files of healthy measurements, and different types of defects. It is possible that we would then see a slight drop in the performance of the models, the problem becoming more complex because of the different contexts and types of defects.
The final word
We saw an example problem where smart data slicing makes our feature generation software work better for cyclic problems. In this example, by slicing in relation to the number of rotations and not to time, the performance of the best model goes from an AUC of 0.87 to an AUC of 0.995.
The improvement in the performance of the models shows that dealing with the problems via algorithms specifically designed for the cyclic case makes it possible to obtain very good performance in certain cases.
In practice, cycles are often more evident in data, and instances where processing is used to bring out cyclicity in data are rarer.
Other feature generation techniques, and other failure prediction models exist outside of the three models used here. In a future article, we will compare the results obtained using different methods, on several datasets.