Anomaly Detection Algorithms in Python

What are Anomalies?

Anomalies are defined as the data points that are noticed with other data set points and do not have normal behaviour in the data. These data points are different from the dataset's normal behaviour patterns. There are some cases where the data points and their features go outside the expected pattern of the model, thus producing anomalies in the data set.

Classification of Anomaly:

  • Change in events: It shows a sudden change differently from the previous normal behaviour.
  • Drifts: They show a slow, unidirectional change in the data for a long period.
  • Outliers: It shows a short anomalous pattern, which seems to be a non-systematic behaviour in the data set.

Anomalies Detection in Python

Anomalies Detection is an unsupervised data process technique that helps to detect anomalies from the dataset. Anomaly detection is used to detect fraud transactions, disease detection, etc. It also handles case studies with high-class imbalance. There are several techniques that detect anomaly detection and help to make robust data science models.

Anomalies detection is a process of identifying anomalies, filtering, or transforming from the analysis pipeline. There are different ways to detect anomalies in Python. We can train machine learning models to detect anomalies in real time. The anomalies can also be detected using statistical methods, like mean, median, and quantiles. Data visualisation and exploratory data analysis techniques can also be used for detecting anomalies.

What is the need for Anomaly Detection?

Anomaly Detection is majorly used in spam detection, fraudulent transactions, etc. In real life, Anomaly Detection can be used for classification tasks, especially when the training data has a high-class imbalance. It can also be used to predict equipment failure, detect IT failures, DDoS attacks, and cloud cost management.

Anomaly detection is also used in cybersecurity, as it can evaluate a huge data stream that can detect traffic patterns, changes in access requests, etc. It is also used to build most security applications and services used for intrusion detection systems, firewalls, and security tools.

Algorithms and Methods Used for Anomaly Detection

There are different anomaly detection algorithms, including supervised, unsupervised, and semi-supervised algorithms:

  • Supervised Algorithms: These algorithms are the algorithms that are trained on the labelled data. It detects anomalous data points based on the deviation from the normal data.
  • Unsupervised Algorithms: These algorithms are trained on the unlabelled data. It uses statistical techniques to detect anomalous data points depending on the deviation from the data distribution.
  • Semi-supervised Algorithms: These algorithms collectively depend on both supervised and unsupervised learning techniques. It is trained on both labelled and unlabelled data.

Algorithms in Supervised Anomaly Detection

  • Support Vector machines
  • K-Nearest Neighbors (k-NN)
  • Random forests

Algorithms in Unsupervised Anomaly Detection

  • Artificial Neural Networks (ANNs)
  • Isolation Forest
  • Gaussian Mixture Models
  • Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

Algorithms in Semi-Supervised Anomaly Detection

  • Transfer learning
  • Pre-trained Models

An Unsupervised anomaly detection model starts a base distribution or outline of the data by calculating the distance between different points to detect anomalies. The unsupervised methods can be used with unlabelled data sets, which reduces the manual labelling of the huge training data.

Artificial Neural Networks is one of the best algorithms in the unsupervised anomaly detection techniques. We can train the artificial neural networks on a huge unlabelled data set to find complex patterns to classify anomalies. It can be used in detecting anomalies in images.

Density-based spatial clustering of applications with noise is another method in unsupervised anomaly detection techniques. It learns incorrect patterns and overfit the trend in the data. It first detects the outliers in an unsupervised manner. It makes data clusters depending on the continuous regions of high point density.

The supervised anomaly detection technique needs labelled data with anomalous or abnormal labels. The training in these types of cases is more difficult. Thus, it uses classification tasks with imbalanced data. The main aim is to classify the data rather than search the abnormal data correctly. Only by classifying the data, we can get 98% accuracy. Generally, supervised techniques are not acceptable for detecting anomalies as they do not separate the anomalies from the data set.

A semi-supervised algorithm to detect anomalies uses both supervised and unsupervised data. Firstly, it will be trained on an unlabelled data set. Then, the trained models will be fine-tuned on the anomalous data to detect the anomalies in the data distribution.

Let's understand a few of the algorithms in detail.

1. DBSCAN

Density-based spatial Clustering of applications with noise is a clustering machine learning algorithm used to cluster the normal data and detect outliers in an unsupervised manner. The data points after clustering are based on the continuous region of high density to determine the clusters formed. The clusters can be easily spotted using this density-based clustering approach, as the outliers are present without any cluster.

Each point is not assigned to the cluster. There are two parameters used in DBSCAN: minPts (the minimum number of data points used in a cluster) and eps(the minimum distance between the points that need to be kept in the cluster).

When no point is left to be visited, the algorithm randomly chooses a new point. It will check the points recursively to find the points in the eps distance. The points in the eps distance from the current point must be of the same cluster. When it gets the minPts points, it results in a cluster. When the cluster is formed, no other points can be added, and then, the algorithm chooses other random unvisited points.

Implementation:

The DBSCAN algorithm randomly chooses the point and moves it recursively within the path assigned. It will check for those points falling under the eps distance of the nearest neighbour. These points will be assigned in the same cluster, and it will follow these steps recursively and form different clusters. When it covers all the points, it will define the outliers, including those points that do not belong to any cluster.

For anomaly detection, the data set must be clean, as distance is the most important aspect of the algorithm. Python provides the method under the sklearn package to implement the DBSCAN algorithm. The DBSCAN is a module of the sklearn.cluster package.

After the model fitting and training the model, the detected outliers and clusters are stored in DBSSCAN.labels. The sample index assigned to the cluster is returned by the function DBSCAN.core_sample_indices_. These functions can get the remaining indexes and outliers in the fitted data.

2. Support Vector Machine

It is a supervised machine learning model used for classification. It effectively classifies the data by creating subplanes after projecting data into alternate vector space. The SVM algorithm works on two classes for anomaly detection, and it trains the model that maximises the gap between different data groups in the vector space. Multi-class methods can also be used in SVM.

The points outside the range determine the anomalies. One class SVM algorithm is widely used to detect the anomalies in the distribution. It checks whether the data points belong to normal or non-binary classes. It uses non-linear functions which project the training data in a higher dimension space. The hyperplane can separate the vector.

Python provides the method under the sklearn package to implement the SVM algorithm. The one class SVM is a module of the sklearn.svm package.

3. Isolation Forest Model

This algorithm for anomaly detection uses the tree-based approach to isolate anomalies after training the model on the data using unsupervised methods. It randomly initialises decision trees like random forests and splits the nodes into different branches. For detecting anomalies, the isolation forest model needs to go in-depth into the data as the anomalies are data points different from the whole data.

It isolates anomalies from the data set across all decision trees. Python provides the method under the sklearn package to implement the Isolation Forest algorithm. The isolation forest is a module of the sklearn.ensemble package as IsolationForest( ). It includes different parameters, including n_estimator, giving the number of trees; max_samples, which builds the trees; and vital contamination factor, which determines the ratio of the abnormal data in the training data. Determining the quantity of anomaly required by the contamination factor is difficult. This is a major drawback of this algorithm.

4. Anomaly Detection using Autoencoders

Autoencoders use the semi-supervised method for detecting anomalies in the data. The autoencoders have many parameters to be trained, which need many data for tuning. We need to pre-train the autoencoder to teach the model about the data to detect anomalies. Then, the model needs to be fine-tuned using labelled data to train the model, which will detect anomalies or abnormalities in the data after learning.

At the time of training, the autoencoder encodes the input data to a low-dimensional space. It extracts the essential features from the data to learn how to transform the data (decode the data with minimal error). The autoencoders contain two parts, an encoder and a decoder, which reduce the dimensions and reconstruct the encoded data, respectively.