For some reason, the covariance order seems to matter with the LogisticRegression
classifier in scikit-learn, which seems odd to me, I have 9 covariances and binary output, and when I change the order of the columns and call fit()
and then call predict_proba()
, the output is different . Game example below
logit_model = LogisticRegression(C=1e9, tol=1e-15)
Following
logit_model.fit(df['column_2','column_1'],df['target'])
logit_model.predict_proba(df['column_2','column_1'])
array([[ 0.27387109, 0.72612891] ..])
Gives another result:
logit_model.fit(df['column_1','column_2'],df['target'])
logit_model.predict_proba(df['column_1','column_2'])
array([[ 0.26117794, 0.73882206], ..])
This seems unexpected to me, but maybe it's just a lack of knowledge about internal algorithms and the fitting method.
What am I missing?
EDIT: Here is the complete code and data
data: https://s3-us-west-2.amazonaws.com/gjt-personal/test_model.csv
import pandas as pd
from sklearn.linear_model import LogisticRegression
df = pd.read_csv('test_model.csv',index_col=False)
columns1 =['col_1','col_2','col_3','col_4','col_5','col_6','col_7','col_8','col_9']
columns2 =['col_2','col_1','col_3','col_4','col_5','col_6','col_7','col_8','col_9']
logit_model = LogisticRegression(C=1e9, tol=1e-15)
logit_model.fit(df[columns1],df['target'])
logit_model.predict_proba(df[columns1])
logit_model.fit(df[columns2],df['target'])
logit_model.predict_proba(df[columns2])
It turns out that something is connected with tol=1e-15
, because it gives a different result.
LogisticRegression(C=1e9, tol=1e-15)
.
LogisticRegression(C=1e9)