Performing group mean and standard deviation using NumPy arrays

I have a dataset (X, Y). My independent values ​​of X variables are not unique, so there are several duplicate values, I want to output a new array containing: X_unique, which is a list of unique X values. Y_mean, the average of all Y values ​​corresponding to X_unique, Y_std, the standard deviation of all Y values, relevant X_unique.

x = data[:,0] y = data[:,1] 
+5
source share
3 answers
 x_unique = np.unique(x) y_means = np.array([np.mean(y[x==u]) for u in x_unique]) y_stds = np.array([np.std(y[x==u]) for u in x_unique]) 
+2
source

You can use binned_statistic from scipy.stats , which supports various statistical functions that will be applied in chunks in a 1D array. To get the pieces, we need to sort and get the position of the shifts (where the pieces are changing), for which np.unique would be useful. Introducing all this, here's the implementation -

 from scipy.stats import binned_statistic as bstat # Sort data corresponding to argsort of first column sdata = data[data[:,0].argsort()] # Unique col-1 elements and positions of breaks (elements are not identical) unq_x,breaks = np.unique(sdata[:,0],return_index=True) breaks = np.append(breaks,data.shape[0]) # Use binned statistic to get grouped average and std deviation values idx_range = np.arange(data.shape[0]) avg_y,_,_ = bstat(x=idx_range, values=sdata[:,1], statistic='mean', bins=breaks) std_y,_,_ = bstat(x=idx_range, values=sdata[:,1], statistic='std', bins=breaks) 

In binned_statistic documents, binned_statistic can also use a custom binned_statistic function:

function: a user-defined function that takes a 1D array of values ​​and displays a single numerical statistic. This function will be called by the values ​​in each hopper. Empty baskets will be represented by function ([]) or NaN if this returns an error.

Example input, output -

 In [121]: data Out[121]: array([[2, 5], [2, 2], [1, 5], [3, 8], [0, 8], [6, 7], [8, 1], [2, 5], [6, 8], [1, 8]]) In [122]: np.column_stack((unq_x,avg_y,std_y)) Out[122]: array([[ 0. , 8. , 0. ], [ 1. , 6.5 , 1.5 ], [ 2. , 4. , 1.41421356], [ 3. , 8. , 0. ], [ 6. , 7.5 , 0.5 ], [ 8. , 1. , 0. ]]) 
+4
source

Pandas runs for this task:

 data=np.random.randint(1,5,20).reshape(10,2) import pandas pandas.DataFrame(data).groupby(0).mean() 

gives

  1 0 1 2.666667 2 3.000000 3 2.000000 4 1.500000 
+1
source

Source: https://habr.com/ru/post/1239919/


All Articles