Estimation of the statistical model in R

I have a very large dataset ( ds ). One of its columns is Popularity , type factor ('High' / 'Low').

I divided the data by 70% and 30% to create a training set ( ds_tr ) and a test set ( ds_te ).

I created the following model using logistic regression:

 mdl <- glm(formula = popularity ~ . -url , family= "binomial", data = ds_tr ) 

then I created a predict object (repeat this for ds_te )

 y_hat = predict(mdl, data = ds_tr - url , type = 'response') 

I want to find the accuracy value corresponding to a cutoff threshold of 0.5 and find the response value corresponding to a cutoff threshold of 0.5, so I did:

 library(ROCR) pred <- prediction(y_hat, ds_tr$popularity) perf <- performance(pred, "prec", "rec") 

The result is a table with many values

 str(perf) Formal class 'performance' [package "ROCR"] with 6 slots ..@ x.name : chr "Recall" ..@ y.name : chr "Precision" ..@ alpha.name : chr "Cutoff" ..@ x.values :List of 1 .. ..$ : num [1:27779] 0.00 7.71e-05 7.71e-05 1.54e-04 2.31e-04 ... ..@ y.values :List of 1 .. ..$ : num [1:27779] NaN 1 0.5 0.667 0.75 ... ..@ alpha.values:List of 1 .. ..$ : num [1:27779] Inf 0.97 0.895 0.89 0.887 ... 

How to find specific accuracy and return values โ€‹โ€‹corresponding to a cutoff threshold of 0.5?

+5
source share
1 answer

Sets performance object slots (via @ + list combination)

Create a data set with all possible values:

 probab.cuts <- data.frame( cut=perf@alpha.values [[1]], prec=perf@y.values [[1]], rec=perf@x.values [[1]]) 

You can view all related values.

 probab.cuts 

If you want to select the requested values, this is trivial:

 tail(probab.cuts[probab.cuts$cut > 0.5,], 1) 

Manually check

 tab <- table(ds_tr$popularity, y_hat > 0.5) tab[4]/(tab[4]+tab[2]) # recall tab[4]/(tab[4]+tab[3]) # precision 
+1
source

Source: https://habr.com/ru/post/1239745/


All Articles