How can I read in a CSV file as a MultiIndexed DataFrame when the column heading does not repeat?

I have several .csv files that I would like to read as MultiIndexed DataFrames, but the spanning column heading does not repeat, and therefore I am left with two headers, not MultiIndex.

File test.csv:

A,,B,,C,
a1,a2,b1,b2,c1,c2
1,1,1,1,1,1
2,2,2,2,2,2

When I run the following,

import pandas as pd

df = pd.read_csv('test.csv', header=[0,1])
print(df)

The returned structure is not what I am looking for:

   A Unnamed: 1_level_0  B Unnamed: 3_level_0  C Unnamed: 5_level_0
  a1                 a2 b1                 b2 c1                 c2
0  1                  1  1                  1  1                  1
1  2                  2  2                  2  2                  2

I need a MultiIndex with the first column header acting as follows:

   A     B     C 
  a1 a2 b1 b2 c1 c2
0  1  1  1  1  1  1
1  2  2  2  2  2  2

Is there any way to read in csv as-is to get the desired structure? If not, this is the most efficient way to do this just to modify the csv files so that they explicitly repeat the values โ€‹โ€‹of the external header, how is it?

A,A,B,B,C,C
a1,a2,b1,b2,c1,c2
1,1,1,1,1,1
2,2,2,2,2,2
+4
source share
2

, , python, .

Series level MultiIndex, , , labels:

level_0 = pd.Series(df.columns.levels[0][df.columns.labels[0]])

'Unnamed: *' None fillna :

level_0[level_0.str.startswith('Unnamed: ')] = None
level_0 = level_0.fillna(method = 'ffill')

, values index levels labels DataFrame:

df.columns = pd.MultiIndex(levels = [level_0.values,
                                     df.columns.levels[1]],
                           labels = [level_0.index,
                                     df.columns.labels[1]])
0

, - read_csv header=None, fillna, multiindex , reset_index:

import pandas as pd
import io

temp=u"""A,,B,,C,
a1,a2,b1,b2,c1,c2
1,1,1,1,1,1
2,2,2,2,2,2"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=",", index_col=None, header=None)
print df
#    0    1   2    3   4    5
#0   A  NaN   B  NaN   C  NaN
#1  a1   a2  b1   b2  c1   c2
#2   1    1   1    1   1    1
#3   2    2   2    2   2    2

df.ix[0,:] = df.ix[0,:].fillna(method='ffill')
print df
#    0   1   2   3   4   5
#0   A   A   B   B   C   C
#1  a1  a2  b1  b2  c1  c2
#2   1   1   1   1   1   1
#3   2   2   2   2   2   2

print zip(df.ix[0,:], df.ix[1,:])
#[('A', 'a1'), ('A', 'a2'), ('B', 'b1'), ('B', 'b2'), ('C', 'c1'), ('C', 'c2')]

df.columns = pd.MultiIndex.from_tuples(zip(df.ix[0,:], df.ix[1,:]))
df = df.ix[2:].reset_index(drop=True)

print df
#   A     B     C   
#  a1 a2 b1 b2 c1 c2
#0  1  1  1  1  1  1
#1  2  2  2  2  2  2
0

Source: https://habr.com/ru/post/1625577/


All Articles