Machine Learning Philosophy: Applying Models to Biased Data

I have a problem with machine learning and I don’t know if there is a theoretical solution there.

I have tagged data (call Dataset D1 ) to create an arbitrary forest classification model, and it works well.

Now my main interest is to apply this model to another D2 dataset that has zero labels, that is, I cannot use it for training. The only way to measure performance on D2 is to check the proportions of classes predicted from it.

Problem: D2 is skewed compared to D1 (functions do not have the same average or correspond to the same distribution). Because of this, the model applied to D2 gives highly distorted results with respect to one class. I know this is normal because most D2 looks like a small subset of D1 .

But is there a way to correct this asymmetry? I know, based on the nature of my problem, the proportions of the predicted classes should be less biased. I tried normalization, but that really doesn't help.

I feel like I don’t think directly: 3

+5
source share
2 answers

The issue of intensification. My answer to this consists of three parts.

Disclaimer: No free lunch. Therefore, you can never be sure without checking performance on real test suite labels. In the worst case scenario, you have a drift concept in your problem that makes it impossible to predict your target class. However, there are solutions that can provide good results.

To indicate:

Functions are denoted by X target variable Y and the classifier obtained with f(X) |-> Y The distribution of X in D1 to P(X|D1) (a little misuse of the notation)

Class distribution in Testset

You suggested that you can use the distribution in the predicted variables ("check the proportions of the classes predicted from it"). This, however, can only be an indicator. I build classifiers in the industry to predict what a machine is (predictive maintenance). There are many engineers trying to make my data distorted; this makes machines that produce data more reliable. However, this is not a problem since one class basically disappears. However, classifiers are still valid.

There is a very simple way to the question "how to fix" the distribution in the target shortcuts of the test suite. The idea is to classify all test instances according to the predicted labels and sample (with replacement) of data points according to the desired distribution of the target variable. You can then try to test the distribution on X functions, but that will not tell you too much.

Can there be a skew problem? In fact, this may be because the classifier usually tries to minimize the measure of accuracy measure F1 or some other statistical property. If you know in advance about the distribution in D2 , you can provide a cost function that minimizes the costs of this distribution. These costs can be used to reprocess training data, as indicated in another answer, however, some training algorithms also contain more sophisticated methods to include this information.

Outlier detection

The question is whether it is possible to detect that something has changed in input X This is very important as it may indicate that you had incorrect data. You can apply fairly simple tests, such as mean and distribution in all dimensions. However, this ignores the dependencies between the variables.

For the following two illustrations im using aperture set enter image description here

Two techniques arise in my mind that let you discover that something in the data has changed. The first method is based on the transformation of PCA. Only for numerical, but there are similar ideas for categorical functions. PCA allows you to convert your input to lower space. this is PCA(X,t)=PCA([X1,...,Xn],t)=[Cond1,...,Condm]=Cond with projection t Where usually with n<<m this transformation is still reversible, so PCA^1(Cond,t) = X' and the error MSE(X,X') is small. To detect the problem, you can control this error, and as soon as it increases, you can say that you do not trust your forecasts.

If I create a PCA for all data from versicolor and virginica and build an error in restoring two dimensions (PCA for all aperture sizes), I get

enter image description here

however, if versicolor is new data, the results are less convincing.

enter image description here

However, a PCA (or something similar) is done for numerical data in any case, therefore, it can give good directions without much overhead.

The second technique that I know about is based on the so-called Vector One support machines. Where a machine with a conventional support medium created a classifier that tries to distinguish two target classes of Y One vector support mechanism for one class is trying to separate from invisible data. The use of these methods is quite attractive if a vector support machine is used for classification. You would get two classifications. The first reports the target data, and the second indicates whether similar data had been previously detected.

If I build the classifier of one class on setosa and virginca and the color by novelty, I get the following graph:

enter image description here

As you can see, the data from versicolor is suspicious. In this case, this is a new class. However, if we assume that these are examples of virginia, they drift dangerously close to the hyperplane.

Semi-Supervised Learning and Transductive

To solve your main problem. The idea of ​​Transductive Learning, a special case of instruction controlled by semi-qualified instruction, can be introverted. Semi oversees training, the training set consists of two parts. Labeled data and unlabeled data. Semi-sup-l uses all this data to create a classifier. Transductive learning is a special case where unlabeled data is your D2 test data. The idea was given by Vapnik as "do not try to solve a more complex problem [creating a classifier for all possible data] when you want to solve a simpler problem [predict labels for D2 ]"

open

RCODE for charts

 ggplot(iris)+aes(x=Petal.Width,y=Petal.Length,color=Species)+geom_point()+stat_ellipse() library(e1071) iris[iris$Species %in% c("virginica","setosa"),] ocl <- svm(iris[iris$Species %in% c("virginica","setosa"),3:4],type="one-classification") coloring <- predict(ocl,iris[,3:4],decision.values=TRUE) ggplot(iris)+aes(x=Petal.Width,y=Petal.Length,color=coloring)+geom_point()+stat_ellipse() ggplot(iris)+aes(x=Petal.Width,y=Petal.Length)+geom_point(color=rgb(red=0.8+0.1*attr(coloring,"decision.values"),green=rep(0,150),blue=1-(0.8+0.1*attr(coloring,"decision.values")))) pca <- prcomp(iris[,3:4]) #pca <- prcomp(iris[iris$Species %in% c("virginica","setosa"),1:4], retx = TRUE, scale = TRUE) pca <- prcomp(iris[iris$Species %in% c("virginica","setosa"),1:4], retx = TRUE, scale = TRUE,tol=0.2) pca <- prcomp(iris[iris$Species %in% c("virginica","versicolor"),1:4], retx = TRUE, scale = TRUE,tol=0.4) predicted <-predict(pca,iris[,1:4]) inverted <- t(t(predicted %*% t(pca$rotation)) * pca$scale + pca$center) ggplot(inverted[,3:4]-iris[,3:4])+aes(x=Petal.Width,y=Petal.Length,color=iris$ Species)+geom_point()+stat_ellipse() 
+7
source

There may be a number of factors that could lead to this skewed result:

You seem to indicate that D2 IS is skewed compared to D1, and therefore strongly distorted results may be the expected result (perhaps D2 Dataset is strongly oriented towards the regional part of the problem space, where one class dominates). Depending on the nature of the data, this may to be the right result.

Perhaps D1 is overstressed in a certain class. You can try to train on fewer cases in a class to encourage classification to one of the other classes to determine the outcome. I don’t know how many training or test cases you have, but if they are large and there are more of these class labels in the training data than others, this may lead to over-classification.

Perhaps you could also manipulate the training data closer to the D2 tools to find out what effect this will have on the classification. I have never tried this before.

Hope this helps in some way.

+2
source

Source: https://habr.com/ru/post/1240122/


All Articles