Numpy array: replace nan values ​​with average number of columns

I have a numpy array filled mostly with real numbers, but it has several nan values ​​in it.

How to replace nan with column averages where they are?

+24
python arrays numpy nan
Sep 08 '13 at 22:24
source share
7 answers

No cycles required:

 print(a) [[ 0.93230948 nan 0.47773439 0.76998063] [ 0.94460779 0.87882456 0.79615838 0.56282885] [ 0.94272934 0.48615268 0.06196785 nan] [ 0.64940216 0.74414127 nan nan]] #Obtain mean of columns as you need, nanmean is just convenient. col_mean = np.nanmean(a, axis=0) print(col_mean) [ 0.86726219 0.7030395 0.44528687 0.66640474] #Find indicies that you need to replace inds = np.where(np.isnan(a)) #Place column means in the indices. Align the arrays using take a[inds] = np.take(col_mean, inds[1]) print(a) [[ 0.93230948 0.7030395 0.47773439 0.76998063] [ 0.94460779 0.87882456 0.79615838 0.56282885] [ 0.94272934 0.48615268 0.06196785 0.66640474] [ 0.64940216 0.74414127 0.44528687 0.66640474]] 
+38
Sep 08 '13 at 22:51
source share

Using masked arrays

The standard way to do this using only numpy is to use the masked array module.

Scipy is a pretty heavy package that relies on external libraries, so you should have the numpy-only method. This is borrowed from @ DonaldHobson's answer.

Edit: np.nanmean now a numpy function. However, it does not process all-nan columns ...

Suppose you have an array a :

 >>> a array([[ 0., nan, 10., nan], [ 1., 6., nan, nan], [ 2., 7., 12., nan], [ 3., 8., nan, nan], [ nan, 9., 14., nan]]) >>> import numpy.ma as ma >>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=0), a) array([[ 0. , 7.5, 10. , 0. ], [ 1. , 6. , 12. , 0. ], [ 2. , 7. , 12. , 0. ], [ 3. , 8. , 12. , 0. ], [ 1.5, 9. , 14. , 0. ]]) 

Please note that the value of the masked array should not be the same shape as a , because we use implicit broadcasting line by line.

Also note how well the all-nan column handles. The average value is zero, since you are taking the average value of the null elements. The nanmean does not process all-nan columns:

 >>> col_mean = np.nanmean(a, axis=0) /home/praveen/.virtualenvs/numpy3-mkl/lib/python3.4/site-packages/numpy/lib/nanfunctions.py:675: RuntimeWarning: Mean of empty slice warnings.warn("Mean of empty slice", RuntimeWarning) >>> inds = np.where(np.isnan(a)) >>> a[inds] = np.take(col_mean, inds[1]) >>> a array([[ 0. , 7.5, 10. , nan], [ 1. , 6. , 12. , nan], [ 2. , 7. , 12. , nan], [ 3. , 8. , 12. , nan], [ 1.5, 9. , 14. , nan]]) 



Explanation

Converting a to a masked array gives you

 >>> ma.array(a, mask=np.isnan(a)) masked_array(data = [[0.0 -- 10.0 --] [1.0 6.0 -- --] [2.0 7.0 12.0 --] [3.0 8.0 -- --] [-- 9.0 14.0 --]], mask = [[False True False True] [False False True True] [False False False True] [False False True True] [ True False False True]], fill_value = 1e+20) 

And taking the middle columns, you will get the correct answer, normalizing only for open values:

 >>> ma.array(a, mask=np.isnan(a)).mean(axis=0) masked_array(data = [1.5 7.5 12.0 --], mask = [False False False True], fill_value = 1e+20) 

Also, pay attention to how the mask handles the column perfectly, which is all-nan!

Finally, np.where does the replacement.




Average value

To replace nan values ​​with a meaningful row value, rather than a column average, a small change is required for broadcasting:

 >>> a array([[ 0., 1., 2., 3., nan], [ nan, 6., 7., 8., 9.], [ 10., nan, 12., nan, 14.], [ nan, nan, nan, nan, nan]]) >>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1), a) ValueError: operands could not be broadcast together with shapes (4,5) (4,) (4,5) >>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1)[:, np.newaxis], a) array([[ 0. , 1. , 2. , 3. , 1.5], [ 7.5, 6. , 7. , 8. , 9. ], [ 10. , 12. , 12. , 12. , 14. ], [ 0. , 0. , 0. , 0. , 0. ]]) 
+5
Oct 24 '16 at 0:23
source share

It is not very clean, but I can’t think of a way to do it other than iteration

 #example a = np.arange(16, dtype = float).reshape(4,4) a[2,2] = np.nan a[3,3] = np.nan indices = np.where(np.isnan(a)) #returns an array of rows and column indices for row, col in zip(*indices): a[row,col] = np.mean(a[~np.isnan(a[:,col]), col]) 
+2
Sep 08 '13 at 22:42 on
source share

Alternative : replacing NaN with column interpolation.

 def interpolate_nans(X): """Overwrite NaNs with column value interpolations.""" for j in range(X.shape[1]): mask_j = np.isnan(X[:,j]) X[mask_j,j] = np.interp(np.flatnonzero(mask_j), np.flatnonzero(~mask_j), X[~mask_j,j]) return X 

Usage example:

 X_incomplete = np.array([[10, 20, 30 ], [np.nan, 30, np.nan], [np.nan, np.nan, 50 ], [40, 50, np.nan ]]) X_complete = interpolate_nans(X_incomplete) print X_complete [[10, 20, 30 ], [20, 30, 40 ], [30, 40, 50 ], [40, 50, 50 ]] 

I use this bit of code for time series data, in particular where the columns are attributes and the rows are time-sorted patterns.

+2
Mar 18 '16 at 8:52
source share

If partial is your original data, and replace is an array of the same form containing averaged values, then this code will use the value from the partial if it exists.

 Complete= np.where(np.isnan(partial),replace,partial) 
+2
Aug 29 '16 at 15:18
source share

To extend Donald's answer, I provide a minimal example. Let's say a is ndarray, and we want to replace its null values ​​with the average value of the column.

 In [231]: a Out[231]: array([[0, 3, 6], [2, 0, 0]]) In [232]: col_mean = np.nanmean(a, axis=0) Out[232]: array([ 1. , 1.5, 3. ]) In [228]: np.where(np.equal(a, 0), col_mean, a) Out[228]: array([[ 1. , 3. , 6. ], [ 2. , 1.5, 3. ]]) 
+1
Oct 23 '16 at 20:25
source share

you can try this built-in function:

 x = np.array([np.inf, -np.inf, np.nan, -128, 128]) np.nan_to_num(x) array([ 1.79769313e+308, -1.79769313e+308, 0.00000000e+000, -1.28000000e+002, 1.28000000e+002]) 
-2
Mar 09 '15 at 15:35
source share



All Articles