Convert Pandas DataFrame to multidimensional ndarray

Question

Convert Pandas DataFrame to multidimensional ndarray

I have a DataFrame with columns for x, y, z coordinates and a value at that position, and I want to convert this to a 3D ndarray.

To make things more complex, not all values exist in a DataFrame (you can simply replace them with NaN in ndarray).

A simple example:

df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2], 
                   'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
                   'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
                   'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})

Ndarray should appear:

array([[[  1.,   2.,  nan],
        [  3.,  nan,   4.]],

       [[  5.,   6.,   7.],
        [  8.,   9.,  nan]]])

For two dimensions, this is easy:

array = df.pivot_table(index="y", columns="x", values="value").as_matrix()

However, this method cannot be applied to three or more sizes.

Could you give me some suggestions?

Bonus points, if this also works for more than three dimensions, processes several specific values (taking the average value) and ensures that all x, y, z coordinates are consecutive (by inserting rows / columns of NaN in the absence of a coordinate).

EDIT: A few more explanations:

CSV, x, y, z, . (, 0,1 ) ndarray, () . . .

EDIT: :

jakevdp 1.598s, Divikars 7.405s, JohnE 7.867s, Wens 6.286s .

+4

python numpy pandas

Daniel Sch. 08 . '17 13:24

3

NumPy -

def dataframe_to_array_averaged(df):
    arr = df[['z','y','x']].values
    arr -= arr.min(0)
    out_shp = arr.max(0)+1

    L = np.prod(out_shp)

    val = df['value'].values
    ids = np.ravel_multi_index(arr.T, out_shp)

    avgs = np.bincount(ids, val, minlength=L)/np.bincount(ids, minlength=L)
    return avgs.reshape(out_shp)

, , x, y, z , , , 0/0= NaN, , . , , ( ).

-

In [106]: df
Out[106]: 
   value  x  y  z
0      1  1  1  1  # <=== this is repeated
1      2  2  1  1
2      3  1  2  1
3      4  3  2  1
4      5  1  1  2
5      6  2  1  2
6      7  3  1  2
7      8  1  2  2
8      9  2  2  2
9      4  1  1  1  # <=== this is repeated

In [107]: dataframe_to_array_averaged(df)
__main__:42: RuntimeWarning: invalid value encountered in divide
Out[107]: 
array([[[ 2.5,  2. ,  nan],
        [ 3. ,  nan,  4. ]],

       [[ 5. ,  6. ,  7. ],
        [ 8. ,  9. ,  nan]]])

, -

out = np.full(out_shp,  np.nan)
sums = np.bincount(ids, val)
unq_ids, count = np.unique(ids, return_counts=1)
out.flat[:unq_ids[-1]] = sums
out.flat[unq_ids] /= count

+2

Divakar 08 . '17 13:49

We can use stack

np.reshape(df.groupby(['z', 'y', 'x'])['value'].mean().unstack([1,2]).stack([0,1],dropna=False).values,(2,2,3))


Out[451]: 
array([[[  1.,   2.,  nan],
        [  3.,  nan,   4.]],
       [[  5.,   6.,   7.],
        [  8.,   9.,  nan]]])

0

Wen Dec 08 '17 at 15:14

source share

jakevdp · Accepted Answer · 2017-12-08T13:50:07+0000

groupby, Transform Pandas DataFrame n- n-D Numpy:

grouped = df.groupby(['z', 'y', 'x'])['value'].mean()

# create an empty array of NaN of the right dimensions
shape = tuple(map(len, grouped.index.levels))
arr = np.full(shape, np.nan)

# fill it using Numpy advanced indexing
arr[grouped.index.labels] = grouped.values.flat

print(arr)
# [[[  1.   2.  nan]
#   [  3.  nan   4.]]
# 
#  [[  5.   6.   7.]
#   [  8.   9.  nan]]]

Convert Pandas DataFrame to multidimensional ndarray

More articles: