Random forests and ROC curves in Julia

Question

Random forests and ROC curves in Julia

I use ScikitLearn DecisionTree.jl to create a random forest model for the binary classification problem of one of the RDatasets datasets (see the bottom of the DecisionTree.jl main page, which I mean by ScikitLearn). I also use the MLBase package to evaluate the model.

I built a random forest model of my data and would like to create an ROC curve for this model. Reading the documentation, I understand that the ROC curve is theoretical. I just can't figure out how to create it for a particular model.

From the Wikipedia page, the last part of the first sentence that I highlighted in bold italics is the one that causes my confusion: “In statistics, the receiver performance (ROC) or ROC curve is a graph that illustrates the performance of the binary classifier system because the threshold her discrimination is changing . " Throughout the article, the threshold value is greater, but it still confuses me on binary classification issues. What is the threshold value and how to change it?

In addition, the MLBase documentation for ROC curves says: "Compute an ROC instance or ROC curve (vector of ROC instances), based on the estimated data and the threshold value." But does not mention this threshold anywhere else.

Sample code for my project is given below. Basically, I want to create a ROC curve for a random forest, but I'm not sure how to do it or if it even works.

using DecisionTree
using RDatasets
using MLBase

quakes_data = dataset("datasets", "quakes");

# Add in a binary column as feature column for classification
quakes_data[:MagGT5] = convert(Array{Int32,1}, quakes_data[:Mag] .> 5.0)

# Getting features and labels where label = 1 is mag > 1 and label = 2 is mag <= 5
features = convert(Array, quakes_data[:, [1:3;5]]);
labels = convert(Array, quakes_data[:, 6]);
labels[labels.==0] = 2

# Create a random forest model with the tuning parameters I want
r_f_model = RandomForestClassifier(nsubfeatures = 3, ntrees = 50, partialsampling=0.7, maxdepth = 4)

# Train the model in-place on the dataset (there isn't a fit function without the in-place functionality)
DecisionTree.fit!(r_f_model, features, labels)

# Apply the trained model to the test features data set (here I haven't partitioned into training and test)
r_f_prediction = convert(Array{Int64,1}, DecisionTree.predict(r_f_model, features))

# Applying the model to the training set and looking at model stats
TrainingROC = roc(labels, r_f_prediction) #getting the stats around the model applied to the train set
#     p::T    # positive in ground-truth
#     n::T    # negative in ground-truth
#     tp::T   # correct positive prediction
#     tn::T   # correct negative prediction
#     fp::T   # (incorrect) positive prediction when ground-truth is negative
#     fn::T   # (incorrect) negative prediction when ground-truth is positive

I also read this question and did not find it useful.

+4

machine-learning random-forest julia-lang decision-tree roc

lara Oct 12 '16 at 3:40

source share

1 answer

Dan Getz · Answer 1 · 2016-10-12T04:43:32+0000

- 0/1 ( true/false, red/blue) , . . , 1 0. 0/1, . , 1 ( a 0).

? 0 1 , , 1.

, 0 1 , , 0 (, , , ).

- - . 0 -when-really- 1 1 -when-really- 0 .

0. , 0 x, 0 -when-really-1 y. . , . ( ) ROC, . AUC , .

, , , Google.

, ROC.

Random forests and ROC curves in Julia

More articles: