Turning limitations into opportunities: unsupervised anomaly detection

Ever wondered how to teach a model to detect anomalies? In anomaly detection, especially in complex systems like Kubernetes, spotting rare anomalies with limited examples is a classic challenge.

However, could you imagine a case when you have a dataset where the anomalies make less than 0.5 %, your task is to create a model to detect those 0.5 % anomalies, and, most importantly, you can’t use the given anomalies for training ML models?

Breaking the Mold with Artificial Anomalies

Our project faced such a dilemma: insufficient anomalous data – basically, absent — to effectively train our supervised models. The solution? We decided to break the data ourselves — generating our own anomalies. By simulating various types of anomalies, such as extreme values or unusual patterns in Kubernetes clusters, we could teach our models what to look for.

Training Pipeline

The idea is to train model with real normal cases and generated anomalies. Then, test the model on real anomalies and compare performance on generated / real anomalies. In the ideal scenario, they should match. Or, it could also be the case that the performance on generated anomalies will be a bit worse: it will mean that your generated anomalies are harder to track, thus, model will be more robust in real world.

 
 

Crafting Anomalies

The types of anomalies we introduced varied—some were simple like extreme spikes in usage, while others were complex like subtle yet unusual resource request sequences that could indicate deeper issues. These is an approximate list of anomaly types that we investigated and used in our solution:

1.     Extreme Value Anomalies: These involve values that are significantly higher or lower than the norm. This type of anomaly is typically used to simulate scenarios where there is a sudden spike or drop in the data metrics, such as CPU usage or memory.

2.     Pattern-Based Anomalies: Anomalies that break the normal pattern or sequence of events. These might include unusual sequences of system calls or atypical patterns in network traffic, which can indicate sophisticated attacks or system malfunctions.

3.     Contextual Anomalies: Situations where data points are anomalous in a specific context but might not be outliers in a different setting. For example, high memory usage might be normal during the day but could be considered anomalous if occurring at night.

4.     Collective Anomalies: These occur when a collection of related data points is anomalous compared to the entire dataset. This could be seen in clusters of similar events that are unusual, like repeated login failures from the same IP address within a short timeframe.

The main point was to make the anomalies look as much like real anomalies as possible. We also highly recommend using t-SNE, PCA or other dimensionality reduction and visualization techniques to see how linearly or non-linearly distinguishable the normal & real anomalies & generated anomalies are.

Gaussian Predictors

Alongside traditional machine learning models, we explored the use of a Gaussian predictor. This simple statistical model, surprisingly effective, used basic assumptions about data normality to identify outliers. While it wasn’t the final solution due to its lower precision, it proved invaluable for quick scans and preliminary assessments. We strongly suggest trying out simple models first as their performance could give hints for training more complex ML algorithms. It could be even the case that such simple models as Gaussian Predictor will outperform ML algorithms and will be your final solution.

Conclusion

This project illustrated that sometimes, the key to solving complex problems in data science is as simple as rethinking the available resources — turning limitations into opportunities for innovation. By creatively generating our own anomalies, we were able to train supervised anomaly detection models without real labels for training, achieving relatively high precision and recall. This approach not only solved our initial data scarcity problem but also provided a scalable method to train models in different scenarios lacking specific examples of anomalies.

Need some help with your model? Looking for a custom solution for your enterprise? Contact our team to explore the opportunities together.

Previous
Previous

NLP Technology for Business Solutions

Next
Next

How Self-Checkout AI is Changing the Shopping Experience