Isolation Forest

Isolation Forest is a novel anomaly detection approach designed to locate outliers or abnormalities in a dataset. Unlike previous approaches that profile normal data points, Isolation Forest offers a new approach, identifying abnormalities directly. The core premise of Isolation Forest is founded on the notion that anomalies are usually uncommon and distinct from regular cases, making them simpler to separate.

The Isolation Forest workflow entails creating a set of isolation trees, with each tree constructed by randomly choosing characteristics and splitting data points until each point is isolated in its own leaf node. Anomalies are intended to need fewer partitions to isolate than typical instances, making them easier to identify based on lower average path lengths across all trees.

Code:

Now, we will try to find and eliminate outliers (anomalies) elegantly with the help of Isolation Forest.

Importing Libraries

Reading the Dataset

Output:

Isolation Forest

The next step is some minor pre-processing before we can run IsolationForest. We remove columns with a high number of NaNs (>1000) and fill in missing values for all features. Finally, we validate that there is no missing data.

Output:

Isolation Forest

Isolation Algorithm

The Isolation Forest (IF) approach performs best when trees are generated from a subset of the data set rather than the complete data set. This is quite different from virtually all other strategies, which rely on data and require more of it for increased accuracy. Sub-sampling works wonders in this method since regular instances might disrupt the isolation process by being closer to outliers. In this example, we set max_samples=100, causing Isolation Forest to generate 100 samples to train the base estimator for each feature.

  • fit: fits the base estimator with the max_samples count for the specific feature.
  • predict: returns -1 if the observation is an outlier and 1 otherwise.
  • decision_function: computes the measured outlier score using the fitted model.

The stats dataframe just contains the original sample values, their scores, whether IF considers them an outlier or not, and some basic feature statistics such as min, max, and median values.

Isolation Forest

Output:

Isolation Forest
Isolation Forest

Let's have a look at the results.

  • LotArea has four major outliers (-1) with an outlier score of around -0.33 and values more than 100,000. These are distant from the average value of 10168 for that attribute. We can see that LotArea ranges from 1300 to 215245, thus lowering the influence that these four observations (out of a total of) have on the variance of this feature may aid our modeling approach later on.
  • YearBuilt is less varied than LotArea, with a maximum outlier score of around -0.25. This demonstrates that values are not too distant from the mean. The future's lowest values (~1880) were identified as outliers using IF.
  • BsmtUnfSF resembles YearBuilt but has a significantly larger variance.
  • GarageYrBlt plainly contains outliers at 0 val according to IF, but this is understandable given that these observations were made without any garage. The majority of residences have a GarageYrBlt, and they depart significantly from the average.

Next, we utilize pandas' clipping capability to remove outliers at input levels. It operates by setting a minimum and maximum value for that particular characteristic. All observations with a value less than the minimum will be allocated the min, while all observations with a value more than the max will be assigned the maximum. These are simply examples; you can change the settings as you see fit.

We now retrain the IsolationForest classifier to test if clipping the values improves the outlier ratings. Notice how the outlier scores generated by IsolationForest for the example features have decreased after we clipped them.

Output:

Isolation Forest
Isolation Forest

We demonstrated how Isolation Forest may be used to detect outliers in a dataset. We utilized the housing price data set as an example. Isolation Forest thrives on subsampled data and does not require building the tree from the complete dataset. It performs well with subsampled data. The technique runs very quickly since it does not rely on computationally expensive operations such as distance or density calculation. The training step has a linear time complexity and a low constant, making it suitable for any large-scale data processing application.


Next TopicMcNemar Test