Difference between R.scale () and sklearn.preprocessing.scale ()

I am currently migrating my data analysis from R to Python. When scaling a dataset in R, I will use R.scale (), which in my understanding will do the following: (x-mean (x)) / sd (x)

To replace this function, I tried using sklearn.preprocessing.scale (). From my understanding of the description, he does the same. However, I ran a small test file and found out that both of these methods have different return values. Obviously, the standard deviations do not match ... Can anyone explain why the standard deviations "deviate" from each other?

MWE:

# import packages from sklearn import preprocessing import numpy import rpy2.robjects.numpy2ri from rpy2.robjects.packages import importr rpy2.robjects.numpy2ri.activate() # Set up R namespaces R = rpy2.robjects.r np1 = numpy.array([[1.0,2.0],[3.0,1.0]]) print "Numpy-array:" print np1 print "Scaled numpy array through R.scale()" print R.scale(np1) print "-------" print "Scaled numpy array through preprocessing.scale()" print preprocessing.scale(np1, axis = 0, with_mean = True, with_std = True) scaler = preprocessing.StandardScaler() scaler.fit(np1) print "Mean of preprocessing.scale():" print scaler.mean_ print "Std of preprocessing.scale():" print scaler.std_ 

Conclusion: Output generated by the MWE

+5
source share
2 answers

This seems to be related to how the standard deviation is calculated.

 >>> import numpy as np >>> a = np.array([[1, 2],[3, 1]]) >>> np.std(a, axis=0) array([ 1. , 0.5]) >>> np.std(a, axis=0, ddof=1) array([ 1.41421356, 0.70710678]) 

From the numpy.std documentation ,

ddof: int, optional

Means Delta Degrees Of Freedom. The divisor used in the calculations is N - ddof, where N is the number of elements. By default, ddof is zero.

Apparently R.scale() uses ddof=1 , but sklearn.preprocessing.StandardScaler() uses ddof=0 .

EDIT: (To explain how to use alternate ddof)

There seems to be no easy way to compute std with an alternative ddof without accessing the variables of the StandardScaler () object itself.

 sc = StandardScaler() sc.fit(data) # Now, sc.mean_ and sc.std_ are the mean and standard deviation of the data # Replace the sc.std_ value using std calculated using numpy sc.std_ = numpy.std(data, axis=0, ddof=1) 
+5
source

The R.scale documentation states:

The rms value for a (possibly centered) column is defined as sqrt (sum (x ^ 2) / (n-1)), where x is the vector of missing values ​​and n is the number of non-missing values. In the case of the center = TRUE, this is the same as the standard deviation, but in general it is not. (To scale over standard deviations without centering, use the scale (x, center = FALSE, scale = apply (x, 2, sd, na.rm = TRUE).))

However, sklearn.preprocessing.StandardScale always scales with standard deviation.

In my case, I want to copy R.scale in Python without centering, I followed @Sid's advice a little differently:

 import numpy as np def get_scale_1d(v): # I copy this function from R source code haha v = v[~np.isnan(v)] std = np.sqrt( np.sum(v ** 2) / np.max([1, len(v) - 1]) ) return std sc = StandardScaler() sc.fit(data) sc.std_ = np.apply_along_axis(func1d=get_scale_1d, axis=0, arr=x) sc.transform(data) 
0
source

Source: https://habr.com/ru/post/1208356/


All Articles