Handling a Variable Number of Columns with Pandas - Python

I have a dataset that looks like this (no more than 5 columns, but may be smaller)

1,2,3 1,2,3,4 1,2,3,4,5 1,2 1,2,3,4 .... 

I am trying to use pandas read_table to read this in a five-column data frame. I would like to read it without additional massage.

If i try

 import pandas as pd my_cols=['A','B','C','D','E'] my_df=pd.read_table(path,sep=',',header=None,names=my_cols) 

I get an error - "column names have 5 fields, data have 3 fields."

Is there a way to make pandas populate NaN for missing columns when reading data?

+42
python pandas
Mar 06 '13 at 8:52
source share
3 answers

One way to work (at least in 0.10.1 and 0.11.0.dev-fc8de6d):

 >>> !cat ragged.csv 1,2,3 1,2,3,4 1,2,3,4,5 1,2 1,2,3,4 >>> my_cols = ["A", "B", "C", "D", "E"] >>> pd.read_csv("ragged.csv", names=my_cols, engine='python') ABCDE 0 1 2 3 NaN NaN 1 1 2 3 4 NaN 2 1 2 3 4 5 3 1 2 NaN NaN NaN 4 1 2 3 4 NaN 

Note that this approach requires you to provide the names of the columns you want. Not as general as some other methods, but works quite well when applied.

+48
Mar 06 '13 at 15:55
source share

I would also be interested to know if this is possible, from the document this does not seem to be the case. What you could probably do is read the file line by line and merge each read into a DataFrame:

 import pandas as pd df = pd.DataFrame() with open(filepath, 'r') as f: for line in f: df = pd.concat( [df, pd.DataFrame([tuple(line.strip().split(','))])], ignore_index=True ) 

It works, but not in the most elegant way, I think ...

+7
Mar 06 '13 at 9:58 am
source share

Ok Not sure how effective this is, but here's what I did. I would like to hear if there is a better way to do this. Thank!

 from pandas import DataFrame list_of_dicts=[] labels=['A','B','C','D','E'] for line in file: line=line.rstrip('\n') list_of_dicts.append(dict(zip(labels,line.split(',')))) frame=DataFrame(list_of_dicts) 
+1
Mar 06 '13 at 15:40
source share



All Articles