I am trying to reproduce the following R results in Python. In this particular case, the prognostic skill R is lower than the Python skill, but this is usually not the case in my experience (hence the reason for wanting to reproduce the results in Python), so please ignore this detail here.
The goal is to predict flowering species ("versicolor" 0 or "virginica" 1). We have 100 labeled samples, each of which consists of 4 color characteristics: sepal length, sepal width, petal length, petal width. I divided the data into training (60% of the data) and test cases (40% of the data). A 10-fold cross-validation is applied to the training set to find the optimal lambda (optimized parameter "C" in scikit-learn).
I use glmnet in R with alpha set to 1 (for LASSO fine), and for python scikit- find out LogisticRegressionCV using the "liblinear" solver (the only solver that can be used with a L1 penalty). Counting metrics used in cross-validation are the same between both languages. However, somehow the results of the model are different (the hooks and coefficients found for each function vary very little).
R code
library(glmnet)
library(datasets)
data(iris)
y <- as.numeric(iris[,5])
X <- iris[y!=1, 1:4]
y <- y[y!=1]-2
n_sample = NROW(X)
w = .6
X_train = X[0:(w * n_sample),]
y_train = y[0:(w * n_sample)]
X_test = X[((w * n_sample)+1):n_sample,]
y_test = y[((w * n_sample)+1):n_sample]
set.seed(0)
model_lambda <- cv.glmnet(as.matrix(X_train), as.factor(y_train),
nfolds = 10, alpha=1, family="binomial", type.measure="class")
best_s <- model_lambda$lambda.1se
pred <- as.numeric(predict(model_lambda, newx=as.matrix(X_test), type="class" , s=best_s))
print(best_s)
print(sum(y_test==pred)/NROW(pred))
print(coef(model_lambda, s=best_s))
Python code
from sklearn import datasets
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
import numpy as np
iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0]
y = y[y != 0]-1
n_sample = len(X)
w = .6
X_train = X[:int(w * n_sample)]
y_train = y[:int(w * n_sample)]
X_test = X[int(w * n_sample):]
y_test = y[int(w * n_sample):]
X_train_fit = StandardScaler().fit(X_train)
X_train_transformed = X_train_fit.transform(X_train)
clf = LogisticRegressionCV(n_jobs=2, penalty='l1', solver='liblinear', cv=10, scoring = ‘accuracy’, random_state=0)
clf.fit(X_train_transformed, y_train)
print clf.score(X_train_fit.transform(X_test), y_test)
print clf.intercept_
print clf.coef_
print clf.C_