Isolation ForestIsolation Forest is a novel anomaly detection approach designed to locate outliers or abnormalities in a dataset. Unlike previous approaches that profile normal data points, Isolation Forest offers a new approach, identifying abnormalities directly. The core premise of Isolation Forest is founded on the notion that anomalies are usually uncommon and distinct from regular cases, making them simpler to separate. The Isolation Forest workflow entails creating a set of isolation trees, with each tree constructed by randomly choosing characteristics and splitting data points until each point is isolated in its own leaf node. Anomalies are intended to need fewer partitions to isolate than typical instances, making them easier to identify based on lower average path lengths across all trees. Code: Now, we will try to find and eliminate outliers (anomalies) elegantly with the help of Isolation Forest. Importing LibrariesReading the DatasetOutput: The next step is some minor pre-processing before we can run IsolationForest. We remove columns with a high number of NaNs (>1000) and fill in missing values for all features. Finally, we validate that there is no missing data. Output: Isolation AlgorithmThe Isolation Forest (IF) approach performs best when trees are generated from a subset of the data set rather than the complete data set. This is quite different from virtually all other strategies, which rely on data and require more of it for increased accuracy. Sub-sampling works wonders in this method since regular instances might disrupt the isolation process by being closer to outliers. In this example, we set max_samples=100, causing Isolation Forest to generate 100 samples to train the base estimator for each feature.
The stats dataframe just contains the original sample values, their scores, whether IF considers them an outlier or not, and some basic feature statistics such as min, max, and median values. Output: Let's have a look at the results.
Next, we utilize pandas' clipping capability to remove outliers at input levels. It operates by setting a minimum and maximum value for that particular characteristic. All observations with a value less than the minimum will be allocated the min, while all observations with a value more than the max will be assigned the maximum. These are simply examples; you can change the settings as you see fit. We now retrain the IsolationForest classifier to test if clipping the values improves the outlier ratings. Notice how the outlier scores generated by IsolationForest for the example features have decreased after we clipped them. Output: We demonstrated how Isolation Forest may be used to detect outliers in a dataset. We utilized the housing price data set as an example. Isolation Forest thrives on subsampled data and does not require building the tree from the complete dataset. It performs well with subsampled data. The technique runs very quickly since it does not rely on computationally expensive operations such as distance or density calculation. The training step has a linear time complexity and a low constant, making it suitable for any large-scale data processing application. Next TopicMcNemar Test |