Python - Machine Learning

I'm currently trying to understand how machine learning algorithms work, and one thing I really don't get is the obvious difference between the calculated accuracy of the predicted labels and the visual confusion matrix. I will try to explain as clearly as possible.

Here is a fragment of a data set (here you can see 9 samples (about 4 thousand in a real data set), 6 functions and 9 labels (which indicate not numbers, but some values ​​and cannot be compared as 7> 4> 1)):

f1 f2 f3 f4 f5 f6 label 89.18 0.412 9.1 24.17 2.4 1 1 90.1 0.519 14.3 16.555 3.2 1 2 83.42 0.537 13.3 14.93 3.4 1 3 64.82 0.68 9.1 8.97 4.5 2 4 34.53 0.703 4.9 8.22 3.5 2 5 87.19 1.045 4.7 5.32 5.4 2 6 43.23 0.699 14.9 12.375 4.0 2 7 43.29 0.702 7.3 6.705 4.0 2 8 20.498 1.505 1.321 6.4785 3.8 2 9 

In favor of curiosity, I tried a number of algorithms (linear, Gaussian, SVM (SVC, SVR), Bayesian, etc.). As far as I understand the manual, in my case it is better to work with classifiers (discrete), and not with regression (continuous). Using common:

 model.fit(X_train, y_train) model.score(X_test, y_test) 

I got:

 Lin_Reg: 0.855793988736 Log_Reg: 0.463251670379 DTC: 0.400890868597 KNC: 0.41425389755 LDA: 0.550111358575 Gaus_NB: 0.391982182628 Bay_Rid: 0.855698151574 SVC: 0.483296213808 SVR: 0.647914795849 

Continuous algorithms gave the best results. When I used the confusion matrix for the Bayesian ridge (I had to convert the float to integers) to check its result, I got the following:

 Pred l1 l2 l3 l4 l5 l6 l7 l8 l9 True l1 23, 66, 0, 0, 0, 0, 0, 0, 0 l2 31, 57 1, 0, 0, 0, 0, 0, 0 l3 13, 85, 19 0, 0, 0, 0, 0, 0 l4 0, 0, 0, 0 1, 6, 0, 0, 0 l5 0, 0, 0, 4, 8 7, 0, 0, 0 l6 0, 0, 0, 1, 27, 36 7, 0, 0 l7 0, 0, 0, 0, 2, 15, 0 0, 0 l8 0, 0, 0, 1, 1, 30, 8, 0 0 l9 0, 0, 0, 1, 0, 9, 1, 0, 0 

Which made me realize that 85% accuracy is wrong. How can this be explained? Is it because float / int conversion?

Any direct response / link, etc. would be appreciated.

+5
source share
3 answers

Here you mix two very clear concepts of machine learning: regression and classification. Regression usually deals with continuous values, for example. temperature or market value. Classification, on the other hand, can tell you what species of birds are on the record - exactly where you will use the matrix of confusion. This will tell you how many times the algorithm correctly predicted the label and where it made mistakes. The SciPy you use has separate sections for both.

Both for classification tasks and for regression, you can use different indicators for scoring them, so never assume that they are comparable. As Jyavad noted, the "determination coefficient" is very different from the accuracy. I would also recommend reading accuracy and review .

In your case, you clearly have a classification problem, and therefore it should be considered. Also, remember that f6 looks like a discrete set of values.

If you want to quickly experiment with different approaches, I can recommend, for example. H2O , which, along with a good API, has an excellent user interface and allows massive parallel processing. XGBoost is also excellent.

+4
source

Take a look at the documentation here .

If you call score() on the regression methods, they will return the “prediction coefficient R ^ 2” rather than accuracy.

+3
source

Take a look at this one .
Use "model.score (X_test, y_test)".

0
source

Source: https://habr.com/ru/post/1258608/


All Articles