How can I read in a CSV file as a MultiIndexed DataFrame when the column heading does not repeat?

Question

How can I read in a CSV file as a MultiIndexed DataFrame when the column heading does not repeat?

I have several .csv files that I would like to read as MultiIndexed DataFrames, but the spanning column heading does not repeat, and therefore I am left with two headers, not MultiIndex.

File test.csv:

A,,B,,C,
a1,a2,b1,b2,c1,c2
1,1,1,1,1,1
2,2,2,2,2,2

When I run the following,

import pandas as pd

df = pd.read_csv('test.csv', header=[0,1])
print(df)

The returned structure is not what I am looking for:

   A Unnamed: 1_level_0  B Unnamed: 3_level_0  C Unnamed: 5_level_0
  a1                 a2 b1                 b2 c1                 c2
0  1                  1  1                  1  1                  1
1  2                  2  2                  2  2                  2

I need a MultiIndex with the first column header acting as follows:

   A     B     C 
  a1 a2 b1 b2 c1 c2
0  1  1  1  1  1  1
1  2  2  2  2  2  2

Is there any way to read in csv as-is to get the desired structure? If not, this is the most efficient way to do this just to modify the csv files so that they explicitly repeat the values of the external header, how is it?

A,A,B,B,C,C
a1,a2,b1,b2,c1,c2
1,1,1,1,1,1
2,2,2,2,2,2

+4

python pandas

Kelly moran Jan 22 '16 at 22:23

source share

2

, - read_csv header=None, fillna, multiindex , reset_index:

import pandas as pd
import io

temp=u"""A,,B,,C,
a1,a2,b1,b2,c1,c2
1,1,1,1,1,1
2,2,2,2,2,2"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=",", index_col=None, header=None)
print df
#    0    1   2    3   4    5
#0   A  NaN   B  NaN   C  NaN
#1  a1   a2  b1   b2  c1   c2
#2   1    1   1    1   1    1
#3   2    2   2    2   2    2

df.ix[0,:] = df.ix[0,:].fillna(method='ffill')
print df
#    0   1   2   3   4   5
#0   A   A   B   B   C   C
#1  a1  a2  b1  b2  c1  c2
#2   1   1   1   1   1   1
#3   2   2   2   2   2   2

print zip(df.ix[0,:], df.ix[1,:])
#[('A', 'a1'), ('A', 'a2'), ('B', 'b1'), ('B', 'b2'), ('C', 'c1'), ('C', 'c2')]

df.columns = pd.MultiIndex.from_tuples(zip(df.ix[0,:], df.ix[1,:]))
df = df.ix[2:].reset_index(drop=True)

print df
#   A     B     C   
#  a1 a2 b1 b2 c1 c2
#0  1  1  1  1  1  1
#1  2  2  2  2  2  2

0

jezrael 23 . '16 6:21

vahndi · Accepted Answer · 2016-01-23T04:41:13+0000

, , python, .

Series level MultiIndex, , , labels:

level_0 = pd.Series(df.columns.levels[0][df.columns.labels[0]])

'Unnamed: *' None fillna :

level_0[level_0.str.startswith('Unnamed: ')] = None
level_0 = level_0.fillna(method = 'ffill')

, values index levels labels DataFrame:

df.columns = pd.MultiIndex(levels = [level_0.values,
                                     df.columns.levels[1]],
                           labels = [level_0.index,
                                     df.columns.labels[1]])

How can I read in a CSV file as a MultiIndexed DataFrame when the column heading does not repeat?

More articles: