LogisticRegression scikit studies covariance (column) issues in training

Question

LogisticRegression scikit studies covariance (column) issues in training

For some reason, the covariance order seems to matter with the LogisticRegressionclassifier in scikit-learn, which seems odd to me, I have 9 covariances and binary output, and when I change the order of the columns and call fit()and then call predict_proba(), the output is different . Game example below

logit_model = LogisticRegression(C=1e9, tol=1e-15)

Following

logit_model.fit(df['column_2','column_1'],df['target'])
logit_model.predict_proba(df['column_2','column_1'])

array([[ 0.27387109,  0.72612891] ..])

Gives another result:

logit_model.fit(df['column_1','column_2'],df['target'])
logit_model.predict_proba(df['column_1','column_2'])

array([[ 0.26117794,  0.73882206], ..])

This seems unexpected to me, but maybe it's just a lack of knowledge about internal algorithms and the fitting method.

What am I missing?

EDIT: Here is the complete code and data

data: https://s3-us-west-2.amazonaws.com/gjt-personal/test_model.csv

import pandas as pd
from sklearn.linear_model import LogisticRegression

df = pd.read_csv('test_model.csv',index_col=False)

columns1 =['col_1','col_2','col_3','col_4','col_5','col_6','col_7','col_8','col_9']
columns2 =['col_2','col_1','col_3','col_4','col_5','col_6','col_7','col_8','col_9']

logit_model = LogisticRegression(C=1e9, tol=1e-15)

logit_model.fit(df[columns1],df['target'])
logit_model.predict_proba(df[columns1])

logit_model.fit(df[columns2],df['target'])
logit_model.predict_proba(df[columns2])

It turns out that something is connected with tol=1e-15, because it gives a different result.

LogisticRegression(C=1e9, tol=1e-15)

.

LogisticRegression(C=1e9)

+4

python python-3.x pandas scikit-learn logistic-regression

Glen Thompson 18 . '17 20:45

2

.

DataFrame sklearn, ( )

, , , , - , .

:

array([[ 0.26117794, 0.73882206], ..]) , , , 26% 0 74% 1. . .

, , .

0

Myles Hollowed 18 . '17 20:55

Grr · Accepted Answer · 2017-12-18T21:33:02+0000

.

. StandardScaler , , .

, - , LineSearchWarning ConvergenceWarning. , 1e-15. (1e9), , tol 1e-4 . ( ).

:

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

ss = StandardScaler()
cols1 = np.arange(9)
cols2 = np.array([1,0,2,3,4,5,6,7,8])
X = ss.fit_transform(df.drop('target', axis=1))

lr = LogisticRegression(solver='newton-cg', tol=1e-4, C=1e9)
lr.fit(X[:, cols1], df['target'])
preds_1 = lr.predict_proba(X[:, cols1])

lr.fit(X[:, cols2], df['target'])
preds_2 = lr.predict_proba(X[:, cols2])

preds_1 
array([[  0.00000000e+00,   1.00000000e+00],
       [  0.00000000e+00,   1.00000000e+00],
       [  0.00000000e+00,   1.00000000e+00],
       ...,
       [  1.00000000e+00,   9.09277801e-31],
       [  1.00000000e+00,   3.52079327e-35],
       [  1.00000000e+00,   5.99607407e-30]])

preds_2
array([[  0.00000000e+00,   1.00000000e+00],
       [  0.00000000e+00,   1.00000000e+00],
       [  0.00000000e+00,   1.00000000e+00],
       ...,
       [  1.00000000e+00,   9.09277801e-31],
       [  1.00000000e+00,   3.52079327e-35],
       [  1.00000000e+00,   5.99607407e-30]])

preds_1 == preds_2 , 1-40 + , , , .

LogisticRegression scikit studies covariance (column) issues in training

More articles: