The result of the sklearn standardcaler result is different from the result of the manual

Question

The result of the sklearn standardcaler result is different from the result of the manual

I used the measure of the sklearn calculator (average removal and scaling of variance) to scale the data frame and compared it with the data framework, where I manually subtracted the average value and divided by the standard deviation. The comparison shows consistent small differences. Can anyone explain why? (I used this dataset: http://archive.ics.uci.edu/ml/datasets/Wine

import pandas as pd
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("~/DataSets/WineDataSetItaly/wine.data.txt", names=["Class", "Alcohol", "Malic acid", "Ash", "Alcalinity of ash", "Magnesium", "Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins", "Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline"])

cols = list(df.columns)[1:]    # I didn't want to scale the "Class" column
std_scal = StandardScaler()
standardized = std_scal.fit_transform(df[cols])
df_standardized_fit = pd.DataFrame(standardized, index=df.index, columns=df.columns[1:])

df_standardized_manual = (df - df.mean()) / df.std()
df_standardized_manual.drop("Class", axis=1, inplace=True)

df_differences = df_standardized_fit - df_standardized_manual
df_differences.iloc[:,:5]


    Alcohol    Malic acid   Ash         Alcalinity  Magnesium
0   0.004272    -0.001582   0.000653    -0.003290   0.005384
1   0.000693    -0.001405   -0.002329   -0.007007   0.000051
2   0.000554    0.000060    0.003120    -0.000756   0.000249
3   0.004758    -0.000976   0.001373    -0.002276   0.002619
4   0.000832    0.000640    0.005177    0.001271    0.003606
5   0.004168    -0.001455   0.000858    -0.003628   0.002421

+4

python pandas scikit-learn

Dirk schulz May 27 '17 at 18:24

source share

1 answer

ayhan · Accepted Answer · 2017-05-27T18:33:40+0000

scikit-learn np.std, ( ) pandas ( - 1) (. Wikipedia), (ddof). numpy scikit-learn ddof=0, pandas ddof=1 (docs).

DataFrame.std(axis = None, skipna = None, level = None, ddof = 1, numeric_only = None, ** kwargs)
.
N-1 . ddof

pandas :

df_standardized_manual = (df - df.mean()) / df.std(ddof=0)

:

        Alcohol    Malic acid           Ash  Alcalinity of ash     Magnesium
0 -8.215650e-15 -5.551115e-16  3.191891e-15       0.000000e+00  2.220446e-16
1 -8.715251e-15 -4.996004e-16  3.441691e-15       0.000000e+00  0.000000e+00
2 -8.715251e-15 -3.955170e-16  2.886580e-15      -5.551115e-17  1.387779e-17
3 -8.437695e-15 -4.440892e-16  3.164136e-15      -1.110223e-16  1.110223e-16
4 -8.659740e-15 -3.330669e-16  2.886580e-15       5.551115e-17  2.220446e-16

The result of the sklearn standardcaler result is different from the result of the manual

More articles: