LASSO Play / Logistic Regression Leads to R Using Python Using Iris Dataset

Question

LASSO Play / Logistic Regression Leads to R Using Python Using Iris Dataset

I am trying to reproduce the following R results in Python. In this particular case, the prognostic skill R is lower than the Python skill, but this is usually not the case in my experience (hence the reason for wanting to reproduce the results in Python), so please ignore this detail here.

The goal is to predict flowering species ("versicolor" 0 or "virginica" 1). We have 100 labeled samples, each of which consists of 4 color characteristics: sepal length, sepal width, petal length, petal width. I divided the data into training (60% of the data) and test cases (40% of the data). A 10-fold cross-validation is applied to the training set to find the optimal lambda (optimized parameter "C" in scikit-learn).

I use glmnet in R with alpha set to 1 (for LASSO fine), and for python scikit- find out LogisticRegressionCV using the "liblinear" solver (the only solver that can be used with a L1 penalty). Counting metrics used in cross-validation are the same between both languages. However, somehow the results of the model are different (the hooks and coefficients found for each function vary very little).

R code

library(glmnet)
library(datasets)
data(iris)

y <- as.numeric(iris[,5])
X <- iris[y!=1, 1:4]
y <- y[y!=1]-2

n_sample = NROW(X)

w = .6
X_train = X[0:(w * n_sample),]  # (60, 4)
y_train = y[0:(w * n_sample)]   # (60,)
X_test = X[((w * n_sample)+1):n_sample,]  # (40, 4)
y_test = y[((w * n_sample)+1):n_sample]   # (40,)

# set alpha=1 for LASSO and alpha=0 for ridge regression
# use class for logistic regression
set.seed(0)
model_lambda <- cv.glmnet(as.matrix(X_train), as.factor(y_train),
                        nfolds = 10, alpha=1, family="binomial", type.measure="class")

best_s  <- model_lambda$lambda.1se
pred <- as.numeric(predict(model_lambda, newx=as.matrix(X_test), type="class" , s=best_s))

# best lambda
print(best_s)
# 0.04136537

# fraction correct
print(sum(y_test==pred)/NROW(pred))   
# 0.75

# model coefficients
print(coef(model_lambda, s=best_s))
#(Intercept)  -14.680479
#Sepal.Length   0        
#Sepal.Width   0
#Petal.Length   1.181747
#Petal.Width    4.592025

Python code

from sklearn import datasets
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0]  # four features. Disregard one of the 3 species.                                                                                                                 
y = y[y != 0]-1  # two species: 'versicolor' (0), 'virginica' (1). Disregard one of the 3 species.                                                                               

n_sample = len(X)

w = .6
X_train = X[:int(w * n_sample)]  # (60, 4)
y_train = y[:int(w * n_sample)]  # (60,)
X_test = X[int(w * n_sample):]  # (40, 4)
y_test = y[int(w * n_sample):]  # (40,)

X_train_fit = StandardScaler().fit(X_train)
X_train_transformed = X_train_fit.transform(X_train)

clf = LogisticRegressionCV(n_jobs=2, penalty='l1', solver='liblinear', cv=10, scoring = ‘accuracy’, random_state=0)
clf.fit(X_train_transformed, y_train)

print clf.score(X_train_fit.transform(X_test), y_test)  # score is 0.775
print clf.intercept_  #-1.83569557
print clf.coef_  # [ 0,  0, 0.65930981, 1.17808155] (sepal length, sepal width, petal length, petal width)
print clf.C_  # optimal lambda: 0.35938137

+4

python scikit-learn r logistic-regression lasso

Oliver angelil Apr 24 '17 at 7:09

source share

3 answers

, , - ( , R, , ). ,

print np.bincount(y_train) # [50 10]
print np.bincount(y_test) # [ 0 40]

, . , Python, 0.9.

from sklearn import datasets
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0]  # four features. Disregard one of the 3 species.                                                                                                                 
y = y[y != 0]-1  # two species: 'versicolor' (0), 'virginica' (1). Disregard one of the 3 species.                                                                               

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, 
                                                                    test_size=0.4,
                                                                    random_state=42,
                                                                    stratify=y)


X_train_fit = StandardScaler().fit(X_train)
X_train_transformed = X_train_fit.transform(X_train)

clf = LogisticRegressionCV(n_jobs=2, penalty='l1', solver='liblinear', cv=10, scoring = 'accuracy', random_state=0)
clf.fit(X_train_transformed, y_train)

print clf.score(X_train_fit.transform(X_test), y_test)  # score is 0.9
print clf.intercept_  #0.
print clf.coef_  # [ 0., 0. ,0., 0.30066888] (sepal length, sepal width, petal length, petal width)
print clf.C_ # [ 0.04641589]

+1

ncfirth 24 . '17 12:55

.

-, " python, scikit-learn LogisticRegressionCV " liblinear " ( , L1)". , . sklearn.linear_model, , L1. , , , .

-, . , , 1, 1 1/6 . , , . , sklearn.model_selection.train_test_split , LogisticRegressionCV , , .92

, , glmnet python, . , glmnet sklearn. :

"Scikit-Learn has several solvers that are similar to glmnet, ElasticNetCV, and LogisticRegressionCV, but they have some limitations. The former works only for linear regression, and the latter does not cope with elastic network penalty." - Bill Lattner GLMNET FOR PYTHON

+1

Grr Apr 24 '17 at 12:57

source share

Craig Milhiser · Accepted Answer · 2017-04-24T21:00:19+0000

In the above examples, there are several things that differ from each other:

Coefficient scale
- glmnet (https://cran.r-project.org/web/packages/glmnet/glmnet.pdf) " ". , glmnet.
- Python , . , . .
LogisticRegressionCV stratifiedfolds. glmnet k-fold.
. , scikit-learn logistic (http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) . glmnet .
, - glmnet 100 lambdas. scikit LogisticRegressionCV 10. , scikit , 1e-4 1e4 (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model.LogisticRegressionCV).
. , , .
- glmnet defaults thresh 1e-7
- LogisticRegressionCV tol 1e-4
- , . , . glmnet - " , , ".

, , , . , .

, , , . , , . , , , ..

LASSO Play / Logistic Regression Leads to R Using Python Using Iris Dataset

More articles: