How to use an insulating forest

I try to detect outliers in my dataset and find the Sklearn isolation forest. I can’t understand how to work with this. I enter my training data into it, and it returns me a vector with -1 and values ​​of 1.

Can someone explain to me how this works and give an example?

How can I know that emissions are “real” emissions?

Settings?

Here is my code:

clf = IsolationForest(max_samples=10000, random_state=10) clf.fit(x_train) y_pred_train = clf.predict(x_train) y_pred_test = clf.predict(x_test) [1 1 1 ..., -1 1 1] 
+14
source share
2 answers

You seem to have a lot of questions, let me try to answer one on one, as far as I know. - How does it work? → It works on the fact that the nature of the outliers in any dataset that is outliers is “small and different,” which is quite different from a typical clustering or distance algorithm. At the top level, it works on the logic that outliers take fewer steps to “isolate” the comparison with the “normal” point in any data set. For this, this is what IF does. Suppose you have a set of training data X with n data points, each of which has m characteristics. In the learning process, IF creates isolation trees (binary search trees) for various functions. For training, you have 3 parameters to configure, one is the number of isolation trees ('n_estimators' in sklearn_IsolationForest), the second is the number of samples ('max_samples' in sklearn_IsolationForest), and the third is the number of objects that need to be drawn from X to train each. basic score (max_features in sklearn_IF). 'max_sample' is the number of random samples that he will select from the source dataset to create isolation trees.

During the testing phase, he finds the path length of the test point from all trained isolation trees and finds the average path length. The longer the path, the larger the normal point and vice versa. Based on the average path length. it calculates an anomaly score, decision_function from sklearn_IF can be used to get this. For sklearn_IF, a lower score is a more abnormal pattern. Based on the assessment of the anomaly, you can decide whether this sample is abnormal or not by setting the correct pollution value in the sklearn_IF object. the default pollution value is 0.1, which can be configured to determine the threshold. The amount of contamination of the data set, i.e. share of emissions in the data set.

Training settings → 1. n_estimators, 2. max_samples, 3.max_features. Testing → 1. pollution

+20
source

-1 represents emissions (according to the established model). See the IsolationForest example for a good description of the process. If you have some prior knowledge, you can provide more options to get a more accurate fit. For example, if you know pollution (the share of emissions in a data set), you can specify it as input. The default is 0.1. See the description of the parameters here .

+7
source

Source: https://habr.com/ru/post/1265997/


All Articles