Implementing the R scale function in pandas in Python?

What is the effective equivalent of the R scale function in pandas? For instance.

 newdf <- scale(df) 

recorded in pandas? Is there an elegant way with transform ?

+4
source share
2 answers

Scaling is very common in machine learning tasks, so it is implemented in the scikit-learn preprocessing module. You can pass pandas DataFrame to your scale method.

The only problem is that the returned object is no longer a DataFrame, but a numpy array; which is usually not a real problem if you still want to pass it into a machine learning model (e.g. SVM or logistic regression). If you want to save a DataFrame, this will require some workaround:

 from sklearn.preprocessing import scale from pandas import DataFrame newdf = DataFrame(scale(df), index=df.index, columns=df.columns) 

See also here .

+7
source

I don't know R, but, after reading the documentation, it looks like the following will do the trick (albeit a little less general)

 def scale(y, c=True, sc=True): x = y.copy() if c: x -= x.mean() if sc and c: x /= x.std() elif sc: x /= np.sqrt(x.pow(2).sum().div(x.count() - 1)) return x 

For a more general version, you probably need to do a type / length check.

EDIT: Added denominator explanation in elif sc: section elif sc:

From the docs R:

  ... If 'scale' is 'TRUE' then scaling is done by dividing the (centered) columns of 'x' by their standard deviations if 'center' is 'TRUE', and the root mean square otherwise. If 'scale' is 'FALSE', no scaling is done. The root-mean-square for a (possibly centered) column is defined as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing values and n is the number of non-missing values. In the case 'center = TRUE', this is the same as the standard deviation, but in general it is not. 

The line np.sqrt(x.pow(2).sum().div(x.count() - 1)) calculates the average square of the root using the definition using the first squaring x (the pow method), then summing the lines, and then dividing by NaN counters in each column ( count ).

As a side note, the reason I just didn’t just calculate the RMS after centering is because the std method calls bottleneck to calculate this expression faster in this special case, when you want to calculate the standard deviation and not the more general RMS.

Instead, you could calculate the RMS after centering, maybe worth the benchmark, because now I'm writing this, I'm not really sure what is faster, and I did not compare it.

+6
source

Source: https://habr.com/ru/post/1494841/


All Articles