Performing group mean and standard deviation using NumPy arrays

Question

Performing group mean and standard deviation using NumPy arrays

I have a dataset (X, Y). My independent values of X variables are not unique, so there are several duplicate values, I want to output a new array containing: X_unique, which is a list of unique X values. Y_mean, the average of all Y values corresponding to X_unique, Y_std, the standard deviation of all Y values, relevant X_unique.

x = data[:,0] y = data[:,1]

+5

python arrays numpy

obtmind Jan 05 '16 at 17:22

source share

3 answers

You can use binned_statistic from scipy.stats , which supports various statistical functions that will be applied in chunks in a 1D array. To get the pieces, we need to sort and get the position of the shifts (where the pieces are changing), for which np.unique would be useful. Introducing all this, here's the implementation -

 from scipy.stats import binned_statistic as bstat # Sort data corresponding to argsort of first column sdata = data[data[:,0].argsort()] # Unique col-1 elements and positions of breaks (elements are not identical) unq_x,breaks = np.unique(sdata[:,0],return_index=True) breaks = np.append(breaks,data.shape[0]) # Use binned statistic to get grouped average and std deviation values idx_range = np.arange(data.shape[0]) avg_y,_,_ = bstat(x=idx_range, values=sdata[:,1], statistic='mean', bins=breaks) std_y,_,_ = bstat(x=idx_range, values=sdata[:,1], statistic='std', bins=breaks)

In binned_statistic documents, binned_statistic can also use a custom binned_statistic function:

function: a user-defined function that takes a 1D array of values and displays a single numerical statistic. This function will be called by the values in each hopper. Empty baskets will be represented by function ([]) or NaN if this returns an error.

Example input, output -

 In [121]: data Out[121]: array([[2, 5], [2, 2], [1, 5], [3, 8], [0, 8], [6, 7], [8, 1], [2, 5], [6, 8], [1, 8]]) In [122]: np.column_stack((unq_x,avg_y,std_y)) Out[122]: array([[ 0. , 8. , 0. ], [ 1. , 6.5 , 1.5 ], [ 2. , 4. , 1.41421356], [ 3. , 8. , 0. ], [ 6. , 7.5 , 0.5 ], [ 8. , 1. , 0. ]])

+4

Divakar Jan 05 '16 at 18:07

source share

Pandas runs for this task:

 data=np.random.randint(1,5,20).reshape(10,2) import pandas pandas.DataFrame(data).groupby(0).mean()

gives

  1 0 1 2.666667 2 3.000000 3 2.000000 4 1.500000

+1

BM Jan 05 '16 at 19:15

source share

Peter · Accepted Answer · 2016-01-05T18:06:06+0000

 x_unique = np.unique(x) y_means = np.array([np.mean(y[x==u]) for u in x_unique]) y_stds = np.array([np.std(y[x==u]) for u in x_unique])

Performing group mean and standard deviation using NumPy arrays

More articles: