Read_csv with missing / incomplete header or irregular number of columns

Question

Read_csv with missing / incomplete header or irregular number of columns

I have file.csv with ~ 15k lines that look like

 SAMPLE_TIME, POS, OFF, HISTOGRAM 2015-07-15 16:41:56, 0-0-0-0-3, 1, 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0, 2015-07-15 16:42:55, 0-0-0-0-3, 1, 0,0,5,9,0,0,0,0,0,2,0,0,0,50,0, 2015-07-15 16:43:55, 0-0-0-0-3, 1, 0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0, 2015-07-15 16:44:56, 0-0-0-0-3, 1, 2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0

I wanted it to be imported into pandas.DataFrame with any random value given for a column that has no header, something like this:

 SAMPLE_TIME, POS, OFF, HISTOGRAM 1 2 3 4 5 6 2015-07-15 16:41:56, 0-0-0-0-3, 1, 2, 0, 5, 59, 4, 0, 0, 2015-07-15 16:42:55, 0-0-0-0-3, 1, 0, 0, 5, 0, 6, 0, nan 2015-07-15 16:43:55, 0-0-0-0-3, 1, 0, 0, 5, 0, 7, nan nan 2015-07-15 16:44:56, 0-0-0-0-3, 1, 2, 0, 5, 0, 0, 2, nan

It was impossible to import, since I tried another solution, for example, giving a specific header , but still not fun, the only way I could get it to work was to manually add the header to the .csv file. which downplay the goal of automation!

Then I tried this solution : Doing this

 lines=list(csv.reader(open('file.csv'))) header, values = lines[0], lines[1:]

it correctly reads files giving me a list of elements ~ 15k values , each element is a list of lines, where each line is a correctly parsed data field from a file, but when I try to do this:

 data = {h:v for h,v in zip (header, zip(*values))} df = pd.DataFrame.from_dict(data)

or that:

 data2 = {h:v for h,v in zip (str(xrange(16)), zip(*values))} df2 = pd.DataFrame.from_dict(data)

Then the headingless columns disappear and the column order is completely mixed. any idea of a possible solution?

+5

python python-2.7 pandas csv dataframe

Insanebot Dec 18 '15 at 14:48

source share

4 answers

You can split the HISTOGRAM columns into the new DataFrame and concat into the original.

 print df SAMPLE_TIME, POS, OFF, \ 0 2015-07-15 16:41:56 0-0-0-0-3, 1, 1 2015-07-15 16:42:55 0-0-0-0-3, 1, 2 2015-07-15 16:43:55 0-0-0-0-3, 1, 3 2015-07-15 16:44:56 0-0-0-0-3, 1, HISTOGRAM 0 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0, 1 0,0,5,9,0,0,0,0,0,2,0,0,0,50,0, 2 0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0, 3 2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0

 #create new dataframe from column HISTOGRAM h = pd.DataFrame([ x.split(',') for x in df['HISTOGRAM'].tolist()]) print h 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0 2 0 5 59 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 0 5 9 0 0 0 0 0 2 0 0 0 50 0 None None None None 2 0 0 5 5 0 0 0 0 0 2 0 0 0 0 4 0 0 0 None 3 2 0 5 0 0 0 0 0 0 2 0 0 0 6 0 0 0 0 None None #append to original, rename 0 column df = pd.concat([df, h], axis=1).rename(columns={0:'HISTOGRAM'}) print df HISTOGRAM HISTOGRAM 1 2 3 4 5 ... 10 \ 0 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0, 2 0 5 59 0 0 ... 0 1 0,0,5,9,0,0,0,0,0,2,0,0,0,50,0, 0 0 5 9 0 0 ... 0 2 0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0, 0 0 5 5 0 0 ... 0 3 2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0 2 0 5 0 0 0 ... 0 11 12 13 14 15 16 17 18 19 0 0 0 0 0 0 0 0 0 1 0 0 50 0 None None None None 2 0 0 0 4 0 0 0 None 3 0 0 6 0 0 0 0 None None [4 rows x 24 columns]

+3

jezrael Dec 18 '15 at 15:11

source share

So how about this. I made csv from your sample data.

When I import the lines:

 with open('test.csv','rb') as f: lines = list(csv.reader(f)) headers, values =lines[0],lines[1:]

to generate nice header names use this line:

 headers = [i or ind for ind, i in enumerate(headers)]

therefore, due to the way (I assume) csv works, headers should have an empty string of string values. empty rows are evaluated as False, so this understanding returns numbered columns for each column without a header.

Then just do df:

 df = pd.DataFrame(values,columns=headers)

which is as follows:

 11: SAMPLE_TIME POS OFF HISTOGRAM 4 5 6 7 8 9 \ 0 15/07/2015 16:41 0-0-0-0-3 1 2 0 5 59 0 0 0 1 15/07/2015 16:42 0-0-0-0-3 1 0 0 5 9 0 0 0 2 15/07/2015 16:43 0-0-0-0-3 1 0 0 5 5 0 0 0 3 15/07/2015 16:44 0-0-0-0-3 1 2 0 5 0 0 0 0 ... 12 13 14 15 16 17 18 19 20 21 0 ... 2 0 0 0 0 0 0 0 0 0 1 ... 2 0 0 0 50 0 2 ... 2 0 0 0 0 4 0 0 0 3 ... 2 0 0 0 6 0 0 0 0 [4 rows x 22 columns]

-1

greg_data Dec 18 '15 at 15:02

source share

Assuming your data is in the foo.csv file, you can do the following. It has been tested against Pandas 0.17

 df = pd.read_csv('foo.csv', names=['sample_time', 'pos', 'off', 'histogram', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17'], skiprows=1)

-2

tornesi Dec 18 '15 at 14:59

source share

Padraic cunningham · Accepted Answer · 2015-12-18T15:39:05+0000

You can create columns based on the length of the first actual row:

 from tempfile import TemporaryFile with open("out.txt") as f, TemporaryFile("w+") as t: h, ln = next(f), len(next(f).split(",")) header = h.strip().split(",") f.seek(0), next(f) header += range(ln) print(pd.read_csv(f, names=header))

What will give you:

  SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 \ 0 2015-07-15 16:41:56 0-0-0-0-3 1 2 0 5 59 0 1 2015-07-15 16:42:55 0-0-0-0-3 1 0 0 5 9 0 2 2015-07-15 16:43:55 0-0-0-0-3 1 0 0 5 5 0 3 2015-07-15 16:44:56 0-0-0-0-3 1 2 0 5 0 0 4 5 ... 13 14 15 16 17 18 19 20 21 22 0 0 0 ... 0 0 0 0 0 NaN NaN NaN NaN NaN 1 0 0 ... 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 0 0 ... 4 0 0 0 NaN NaN NaN NaN NaN NaN 3 0 0 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN [4 rows x 27 columns]

Or you can clear the file before going to pandas:

 import pandas as pd from tempfile import TemporaryFile with open("in.csv") as f, TemporaryFile("w+") as t: for line in f: t.write(line.replace(" ", "")) t.seek(0) ln = len(line.strip().split(",")) header = t.readline().strip().split(",") header += range(ln) print(pd.read_csv(t,names=header))

What gives you:

  SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 4 5 ... 11 \ 0 2015-07-1516:41:56 0-0-0-0-3 1 2 0 5 59 0 0 0 ... 0 1 2015-07-1516:42:55 0-0-0-0-3 1 0 0 5 9 0 0 0 ... 0 2 2015-07-1516:43:55 0-0-0-0-3 1 0 0 5 5 0 0 0 ... 0 3 2015-07-1516:44:56 0-0-0-0-3 1 2 0 5 0 0 0 0 ... 0 12 13 14 15 16 17 18 19 20 0 0 0 0 0 0 0 NaN NaN NaN 1 50 0 NaN NaN NaN NaN NaN NaN NaN 2 0 4 0 0 0 NaN NaN NaN NaN 3 6 0 0 0 0 NaN NaN NaN NaN [4 rows x 25 columns]

or reset columns will be all nana:

 print(pd.read_csv(f, names=header).dropna(axis=1,how="all"))

Gives you:

  SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 \ 0 2015-07-15 16:41:56 0-0-0-0-3 1 2 0 5 59 0 1 2015-07-15 16:42:55 0-0-0-0-3 1 0 0 5 9 0 2 2015-07-15 16:43:55 0-0-0-0-3 1 0 0 5 5 0 3 2015-07-15 16:44:56 0-0-0-0-3 1 2 0 5 0 0 4 5 ... 8 9 10 11 12 13 14 15 16 17 0 0 0 ... 2 0 0 0 0 0 0 0 0 0 1 0 0 ... 2 0 0 0 50 0 NaN NaN NaN NaN 2 0 0 ... 2 0 0 0 0 4 0 0 0 NaN 3 0 0 ... 2 0 0 0 6 0 0 0 0 NaN [4 rows x 22 columns]

Read_csv with missing / incomplete header or irregular number of columns

More articles: