I used the measure of the sklearn calculator (average removal and scaling of variance) to scale the data frame and compared it with the data framework, where I manually subtracted the average value and divided by the standard deviation. The comparison shows consistent small differences. Can anyone explain why? (I used this dataset: http://archive.ics.uci.edu/ml/datasets/Wine
import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.read_csv("~/DataSets/WineDataSetItaly/wine.data.txt", names=["Class", "Alcohol", "Malic acid", "Ash", "Alcalinity of ash", "Magnesium", "Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins", "Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline"])
cols = list(df.columns)[1:]
std_scal = StandardScaler()
standardized = std_scal.fit_transform(df[cols])
df_standardized_fit = pd.DataFrame(standardized, index=df.index, columns=df.columns[1:])
df_standardized_manual = (df - df.mean()) / df.std()
df_standardized_manual.drop("Class", axis=1, inplace=True)
df_differences = df_standardized_fit - df_standardized_manual
df_differences.iloc[:,:5]
Alcohol Malic acid Ash Alcalinity Magnesium
0 0.004272 -0.001582 0.000653 -0.003290 0.005384
1 0.000693 -0.001405 -0.002329 -0.007007 0.000051
2 0.000554 0.000060 0.003120 -0.000756 0.000249
3 0.004758 -0.000976 0.001373 -0.002276 0.002619
4 0.000832 0.000640 0.005177 0.001271 0.003606
5 0.004168 -0.001455 0.000858 -0.003628 0.002421
source
share