Pandas: merge hierarchical data

Question

Pandas: merge hierarchical data

I am looking for a way to combine data with a complex hierarchy in pandas DataFrame. This hierarchy is due to different interdependencies in the data. For instance. there are parameters that determine how the data was created, then there are time-dependent observables, spatially dependent observables and observables that depend on both time and space.

To be more explicit: suppose I have the following data.

#  Parameters
t_max = 2
t_step = 15
sites = 4

# Purely time-dependent
t = np.linspace(0, t_max, t_step)
f_t = t**2 - t

# Purely site-dependent
position = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])  # (x, y)
site_weight = np.arange(sites)

# Time-, and site-dependent.
occupation = np.arange(t_step*sites).reshape((t_step, sites))

# Time-, and site-, site-dependent
correlation = np.arange(t_step*sites*sites).reshape((t_step, sites, sites))

(In the end, of course, I would have many such data sets. One for each parameter set.)

Now I would like to write all this in pandas DataFrame. I assume the end result looks something like this:

| ----- parameters ----- | -------------------------------- observables --------------------------------- |
|                        |                                        | ---------- time-dependent ----------- |
|                        | ----------- site-dependent --- )       ( ------------------------ |            |
|                        |                                | - site2-dependent - |                         |
| sites | t_max | t_step | site | r_x | r_y | site weight | site2 | correlation | occupation | f_t | time |

, . , , , , . DataFrame .

, , , Pandas.

DataFrame, , - (, f_t time, site). , , , . (, , , , ).

, . , . , .

, , .

ind_time = pd.Index(t, name='time')
ind_site = pd.Index(np.arange(sites), name='site')
ind_site_site = pd.MultiIndex.from_product([ind_site, ind_site], names=['site', 'site2'])
ind_time_site = pd.MultiIndex.from_product([ind_time, ind_site], names=['time', 'site'])
ind_time_site_site = pd.MultiIndex.from_product([ind_time, ind_site, ind_site], names=['time', 'site', 'site2'])

`DataFrame` s

.

df_parms = pd.DataFrame({'t_max': t_max, 't_step': t_step, 'sites': sites}, index=[0])
df_time = pd.DataFrame({'f_t': f_t}, index=ind_time)
df_position = pd.DataFrame(position, columns=['r_x', 'r_y'], index=ind_site)
df_weight = pd.DataFrame(site_weight, columns=['site weight'], index=ind_site)
df_occupation = pd.DataFrame(occupation.flatten(), index=ind_time_site, columns=['occupation'])
df_correlation = pd.DataFrame(correlation.flatten(), index=ind_time_site_site, columns=['correlation'])

index=[0] df_parms , pandas . , , , . , , .

, DataFrame.

df_all_but_parms = pd.merge(
  pd.merge(
    pd.merge(
      df_time.reset_index(),
      df_occupation.reset_index(),
      how='outer'
    ),
    df_correlation.reset_index(),
    how='outer'
  ),
  pd.merge(
    df_position.reset_index(),
    df_weight.reset_index(),
    how='outer'
  ),
  how='outer'
)

, . merge , . , , , . , ? concat, . , time site.

, .

pd.concat([df_parms, df_all_but_parms], axis=1, keys=['parameters', 'observables'])

:

         parameters                 observables                                                                       
              sites  t_max  t_step         time       f_t  site  occupation  site2  correlation  r_x  r_y  site weight
    0             4      2      15     0.000000  0.000000     0           0      0            0    0    0            0
    1           NaN    NaN     NaN     0.000000  0.000000     0           0      1            1    0    0            0
    2           NaN    NaN     NaN     0.000000  0.000000     0           0      2            2    0    0            0
    3           NaN    NaN     NaN     0.000000  0.000000     0           0      3            3    0    0            0
    4           NaN    NaN     NaN     0.142857 -0.122449     0           4      0           16    0    0            0
    ..          ...    ...     ...          ...       ...   ...         ...    ...          ...  ...  ...          ...
    235         NaN    NaN     NaN     1.857143  1.591837     3          55      3          223    1    1            3
    236         NaN    NaN     NaN     2.000000  2.000000     3          59      0          236    1    1            3
    237         NaN    NaN     NaN     2.000000  2.000000     3          59      1          237    1    1            3
    238         NaN    NaN     NaN     2.000000  2.000000     3          59      2          238    1    1            3
    239         NaN    NaN     NaN     2.000000  2.000000     3          59      3          239    1    1            3

, , . NaN . , .

: pandas, hdf5. ?

Update

. , . , .

.

all_observables = [ df_time, df_position, df_weight, df_occupation, df_correlation ]
flat = map(pd.DataFrame.reset_index, all_observables)
for df in flat:
    for c in df_parms:
        df[c] = df_parms.loc[0,c]

.

df_all = reduce(lambda a, b: pd.merge(a, b, how='outer'), flat)

:

         time       f_t  sites  t_max  t_step  site  r_x  r_y  site weight  occupation  site2  correlation
0    0.000000  0.000000      4      2      15     0    0    0            0           0      0            0
1    0.000000  0.000000      4      2      15     0    0    0            0           0      1            1
2    0.000000  0.000000      4      2      15     0    0    0            0           0      2            2
3    0.000000  0.000000      4      2      15     0    0    0            0           0      3            3
4    0.142857 -0.122449      4      2      15     0    0    0            0           4      0           16
5    0.142857 -0.122449      4      2      15     0    0    0            0           4      1           17
6    0.142857 -0.122449      4      2      15     0    0    0            0           4      2           18
..        ...       ...    ...    ...     ...   ...  ...  ...          ...         ...    ...          ...
233  1.857143  1.591837      4      2      15     3    1    1            3          55      1          221
234  1.857143  1.591837      4      2      15     3    1    1            3          55      2          222
235  1.857143  1.591837      4      2      15     3    1    1            3          55      3          223
236  2.000000  2.000000      4      2      15     3    1    1            3          59      0          236
237  2.000000  2.000000      4      2      15     3    1    1            3          59      1          237
238  2.000000  2.000000      4      2      15     3    1    1            3          59      2          238
239  2.000000  2.000000      4      2      15     3    1    1            3          59      3          239

, :

df_all.set_index(['t_max', 't_step', 'sites', 'time', 'site', 'site2'], inplace=True)

                                             f_t  r_x  r_y  site weight  occupation  correlation
t_max t_step sites time     site site2                                                          
2     15     4     0.000000 0    0      0.000000    0    0            0           0            0
                                 1      0.000000    0    0            0           0            1
                                 2      0.000000    0    0            0           0            2
                                 3      0.000000    0    0            0           0            3
                   0.142857 0    0     -0.122449    0    0            0           4           16
                                 1     -0.122449    0    0            0           4           17
                                 2     -0.122449    0    0            0           4           18
...                                          ...  ...  ...          ...         ...          ...
                   1.857143 3    1      1.591837    1    1            3          55          221
                                 2      1.591837    1    1            3          55          222
                                 3      1.591837    1    1            3          55          223
                   2.000000 3    0      2.000000    1    1            3          59          236
                                 1      2.000000    1    1            3          59          237
                                 2      2.000000    1    1            3          59          238
                                 3      2.000000    1    1            3          59          239

+4

python merge pandas dataframe

Lemming 15 . '14 9:25

1

Jeff · Accepted Answer · 2014-07-15T12:51:48+0000

, - , df_parms . , .

In [67]: pd.set_option('max_rows',10)

In [68]: dfx = df_all_but_parms.copy()

( , ).

In [69]: for c in df_parms.columns:
             dfx[c] = df_parms.loc[0,c]

In [70]: dfx
Out[70]: 
         time       f_t  site  occupation  site2  correlation  r_x  r_y  site weight  sites  t_max  t_step
0    0.000000  0.000000     0           0      0            0    0    0            0      4      2      15
1    0.000000  0.000000     0           0      1            1    0    0            0      4      2      15
2    0.000000  0.000000     0           0      2            2    0    0            0      4      2      15
3    0.000000  0.000000     0           0      3            3    0    0            0      4      2      15
4    0.142857 -0.122449     0           4      0           16    0    0            0      4      2      15
..        ...       ...   ...         ...    ...          ...  ...  ...          ...    ...    ...     ...
235  1.857143  1.591837     3          55      3          223    1    1            3      4      2      15
236  2.000000  2.000000     3          59      0          236    1    1            3      4      2      15
237  2.000000  2.000000     3          59      1          237    1    1            3      4      2      15
238  2.000000  2.000000     3          59      2          238    1    1            3      4      2      15
239  2.000000  2.000000     3          59      3          239    1    1            3      4      2      15

[240 rows x 12 columns]

( )

In [71]: dfx.set_index(['sites','t_max','t_step'])
Out[71]: 
                        time       f_t  site  occupation  site2  correlation  r_x  r_y  site weight
sites t_max t_step                                                                                 
4     2     15      0.000000  0.000000     0           0      0            0    0    0            0
            15      0.000000  0.000000     0           0      1            1    0    0            0
            15      0.000000  0.000000     0           0      2            2    0    0            0
            15      0.000000  0.000000     0           0      3            3    0    0            0
            15      0.142857 -0.122449     0           4      0           16    0    0            0
...                      ...       ...   ...         ...    ...          ...  ...  ...          ...
            15      1.857143  1.591837     3          55      3          223    1    1            3
            15      2.000000  2.000000     3          59      0          236    1    1            3
            15      2.000000  2.000000     3          59      1          237    1    1            3
            15      2.000000  2.000000     3          59      2          238    1    1            3
            15      2.000000  2.000000     3          59      3          239    1    1            3

[240 rows x 9 columns]

Pandas: merge hierarchical data

DataFrame s

Update

More articles:

`DataFrame` s