Pandas - Combining Multiple DataFrames

Question

Pandas - Combining Multiple DataFrames

This is a frequent question. I just can't combine everything together. The goal is to create one DataFrame (guessing using MultiIndex), which can be accessed as follows:

ticker = 'GOLD' date = pd.to_datetime('1978/03/31') current_bar = df.ix[ticker].ix[date]

Can I just say: current_bar.Last?

Anyway, here are the files and how I upload them.

 In [108]: df = pd.read_csv('GOLD.csv', parse_dates='Date', index_col='Date') In [109]: df Out[109]: Exp Last Volume Date 1978-03-30 198002 995.6 54 1978-03-31 198002 999.5 78 In [110]: df2 = pd.read_csv('SPX.csv', parse_dates='Date', index_col='Date') In [111]: df2 Out[111]: Exp Last Volume Date 1978-03-30 198003 215.5 25 1978-03-31 198003 214.1 99

Ideally, I want this to look like this (I think):

 ticker GOLD SPX values Exp Last Volume Exp Last Volume Date 1978-03-30 198002 995.6 54 198003 215.5 25 1978-03-31 198002 999.5 78 198003 214.1 99

I think my questions are:
- How to make this hierarchical (actual data have more than 20 identical columns for each file)
- How do I merge files (I have about 100 that everyone needs in 1 DataFrame)
- Is my assumption that I can simply do: current_bar.Last to get the values right?

Thank you very much.

+1

python pandas

trubby317 Jul 20 '14 at 18:49

source share

1 answer

unutbu · Accepted Answer · 2014-07-20T19:11:55+0000

You can use pd.concat to concatenate DataFrames. (A join joins DataFrames, and a join joins DataFrames based on common indexes or columns.) When you specify the keys parameter, you get a hierarchical index:

 import pandas as pd df = pd.read_csv('GOLD.csv', parse_dates='Date', index_col='Date', sep='\s+') df2 = pd.read_csv('SPX.csv', parse_dates='Date', index_col='Date', sep='\s+') result = pd.concat([df, df2], keys=['GOLD', 'SPX'], names=['ticker']).unstack('ticker') result = result.reorder_levels([1, 0], axis=1).sortlevel(level=0, axis=1) print(result)

gives

 ticker GOLD SPX Exp Last Volume Exp Last Volume Date 1978-03-30 198002 995.6 54 198003 215.5 25 1978-03-31 198002 999.5 78 198003 214.1 99

result['Last'] gives a DataFrame:

 In [147]: result['Last'] Out[147]: ticker GOLD SPX Date 1978-03-30 995.6 215.5 1978-03-31 999.5 214.1

I would recommend avoiding the syntax result.Last because it is too close to result.Last , which returns the DataFrame method.

To process more files, you can use the following code:

 import pandas as pd dfs = list() for filename in filenames: df = pd.read_csv(filename, parse_dates='Date', index_col='Date') # compute moving_mean dfs.append(df) keys = [filename[:-4] for filename in filenames] result = pd.concat(dfs, keys=keys, names=['ticker']).unstack('ticker')

Note that this requires enough memory to store a list of all DataFrames in memory plus enough memory to store result .

Pandas - Combining Multiple DataFrames

More articles: