Convert Pandas DataFrame with hierarchical n-level index to nD Numpy array

Question

Is there a good way to convert a DataFrame with an n-level index into an nD Numpy array (aka n-tensor)?


Example

Suppose I created a DataFrame, for example

from pandas import DataFrame, MultiIndex index = range(2), range(3) value = range(2 * 3) frame = DataFrame(value, columns=['value'], index=MultiIndex.from_product(index)).drop((1, 0)) print frame 

which outputs

  value 0 0 0 1 1 2 3 1 1 5 2 6 

An index is a two-level hierarchical index. I can extract a 2-D Numpy array from data using

 print frame.unstack().values 

which outputs

 [[ 0. 1. 2.] [ nan 4. 5.]] 

How does this generalize to an n-level index?

Playing with unstack() , it seems that it can only be used to massage the two-dimensional DataFrame, but not to add an axis.

I can not use, for example. frame.values.reshape(x, y, z) , as this will require the frame to contain exactly x * y * z strings that cannot be guaranteed. This is what I tried to demonstrate in the drop() line in the above example.

Any suggestions are welcome.

+5
source share
1 answer

Edit This approach is much more elegant (and two orders of magnitude faster) than the one I gave below.

 # create an empty array of NaN of the right dimensions shape = map(len, frame.index.levels) arr = np.full(shape, np.nan) # fill it using Numpy advanced indexing arr[frame.index.labels] = frame.values.flat 

The original solution . Given a setting similar to that described above, but in 3-D,

 from pandas import DataFrame, MultiIndex from itertools import product index = range(2), range(2), range(2) value = range(2 * 2 * 2) frame = DataFrame(value, columns=['value'], index=MultiIndex.from_product(index)).drop((1, 0, 1)) print(frame) 

we have

  value 0 0 0 0 1 1 1 0 2 1 3 1 0 0 4 1 0 6 1 7 

Now we continue to use the reshape() route, but with some preprocessing to ensure that the length along each dimension is consistent.

First, flip the data frame with the full Cartesian product of all measurements. NaN values ​​will be added as needed. This operation can be slow and consume a lot of memory depending on the number of measurements and the size of the data frame.

 levels = map(tuple, frame.index.levels) index = list(product(*levels)) frame = frame.reindex(index) print(frame) 

which outputs

  value 0 0 0 0 1 1 1 0 2 1 3 1 0 0 4 1 NaN 1 0 6 1 7 

Now reshape() will work as intended.

 shape = map(len, frame.index.levels) print(frame.values.reshape(shape)) 

which outputs

 [[[ 0. 1.] [ 2. 3.]] [[ 4. nan] [ 6. 7.]]] 

(pretty ugly) single line

 frame.reindex(list(product(*map(tuple, frame.index.levels)))).values\ .reshape(map(len, frame.index.levels)) 
+5
source

Source: https://habr.com/ru/post/1241659/