When to use multiindexing vs. xarray in pandas

Question

When to use multiindexing vs. xarray in pandas

pandas pivot table documentation seems to recommend using more than two data dimensions using multiindexing:

In [1]: import pandas as pd In [2]: import numpy as np In [3]: import pandas.util.testing as tm; tm.N = 3 In [4]: def unpivot(frame): ...: N, K = frame.shape ...: data = {'value' : frame.values.ravel('F'), ...: 'variable' : np.asarray(frame.columns).repeat(N), ...: 'date' : np.tile(np.asarray(frame.index), K)} ...: return pd.DataFrame(data, columns=['date', 'variable', 'value']) ...: In [5]: df = unpivot(tm.makeTimeDataFrame()) In [6]: df Out[6]: date variable value value2 0 2000-01-03 A 0.462461 0.924921 1 2000-01-04 A -0.517911 -1.035823 2 2000-01-05 A 0.831014 1.662027 3 2000-01-03 B -0.492679 -0.985358 4 2000-01-04 B -1.234068 -2.468135 5 2000-01-05 B 1.725218 3.450437 6 2000-01-03 C 0.453859 0.907718 7 2000-01-04 C -0.763706 -1.527412 8 2000-01-05 C 0.839706 1.679413 9 2000-01-03 D -0.048108 -0.096216 10 2000-01-04 D 0.184461 0.368922 11 2000-01-05 D -0.349496 -0.698993 In [7]: df['value2'] = df['value'] * 2 In [8]: df.pivot('date', 'variable') Out[8]: value value2 \ variable ABCDAB date 2000-01-03 -1.558856 -1.144732 -0.234630 -1.252482 -3.117712 -2.289463 2000-01-04 -1.351152 -0.173595 0.470253 -1.181006 -2.702304 -0.347191 2000-01-05 0.151067 -0.402517 -2.625085 1.275430 0.302135 -0.805035 variable CD date 2000-01-03 -0.469259 -2.504964 2000-01-04 0.940506 -2.362012 2000-01-05 -5.250171 2.550861

I thought xarray was created to handle multidimensional data sets, such as:

 In [9]: import xarray as xr In [10]: xr.DataArray(dict([(var, df[df.variable==var].drop('variable', 1)) for var in np.unique(df.variable)])) Out[10]: <xarray.DataArray ()> array({'A': date value value2 0 2000-01-03 0.462461 0.924921 1 2000-01-04 -0.517911 -1.035823 2 2000-01-05 0.831014 1.662027, 'C': date value value2 6 2000-01-03 0.453859 0.907718 7 2000-01-04 -0.763706 -1.527412 8 2000-01-05 0.839706 1.679413, 'B': date value value2 3 2000-01-03 -0.492679 -0.985358 4 2000-01-04 -1.234068 -2.468135 5 2000-01-05 1.725218 3.450437, 'D': date value value2 9 2000-01-03 -0.048108 -0.096216 10 2000-01-04 0.184461 0.368922 11 2000-01-05 -0.349496 -0.698993}, dtype=object)

Is one of these approaches better than the other? Why has xarray not completely replaced multiindexing?

+5

python pandas data-structures multi-index xarray

kilojoules Mar 18 '17 at 15:35

source share

1 answer

Tkanno · Accepted Answer · 2017-07-18T16:24:06+0000

It seems like switching to xarray to work on multidimensional arrays. Pandas will discount the support for the data structure of 3D panels, and the xarray documentation documentation indicates their goals and objectives:

xarray is committed to providing data analysis tools as powerful as pandas designed to work with homogeneous N-dimensional arrays instead of tabular data ...
... Our target audience is anyone who needs N-dimensional labeling of arrays, but we pay special attention to the needs of data analysis by physicists and scientists, especially geologists who already know and love netCDF

The main advantage of xarray over using direct numpy is that it uses labels in the same way as Pandas for multiple dimensions. If you work with 3D data using multi-indexing or xarray, they can be used interchangeably. As the number of dimensions in your dataset increases, xarray becomes much more manageable. I cannot comment on how each of them works in terms of efficiency or speed.

When to use multiindexing vs. xarray in pandas

More articles: