Read_csv with missing / incomplete header or irregular number of columns

I have file.csv with ~ 15k lines that look like

 SAMPLE_TIME, POS, OFF, HISTOGRAM 2015-07-15 16:41:56, 0-0-0-0-3, 1, 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0, 2015-07-15 16:42:55, 0-0-0-0-3, 1, 0,0,5,9,0,0,0,0,0,2,0,0,0,50,0, 2015-07-15 16:43:55, 0-0-0-0-3, 1, 0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0, 2015-07-15 16:44:56, 0-0-0-0-3, 1, 2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0 

I wanted it to be imported into pandas.DataFrame with any random value given for a column that has no header, something like this:

 SAMPLE_TIME, POS, OFF, HISTOGRAM 1 2 3 4 5 6 2015-07-15 16:41:56, 0-0-0-0-3, 1, 2, 0, 5, 59, 4, 0, 0, 2015-07-15 16:42:55, 0-0-0-0-3, 1, 0, 0, 5, 0, 6, 0, nan 2015-07-15 16:43:55, 0-0-0-0-3, 1, 0, 0, 5, 0, 7, nan nan 2015-07-15 16:44:56, 0-0-0-0-3, 1, 2, 0, 5, 0, 0, 2, nan 

It was impossible to import, since I tried another solution, for example, giving a specific header , but still not fun, the only way I could get it to work was to manually add the header to the .csv file. which downplay the goal of automation!


Then I tried this solution : Doing this

 lines=list(csv.reader(open('file.csv'))) header, values = lines[0], lines[1:] 

it correctly reads files giving me a list of elements ~ 15k values , each element is a list of lines, where each line is a correctly parsed data field from a file, but when I try to do this:

 data = {h:v for h,v in zip (header, zip(*values))} df = pd.DataFrame.from_dict(data) 

or that:

 data2 = {h:v for h,v in zip (str(xrange(16)), zip(*values))} df2 = pd.DataFrame.from_dict(data) 

Then the headingless columns disappear and the column order is completely mixed. any idea of ​​a possible solution?

+5
source share
4 answers

You can create columns based on the length of the first actual row:

 from tempfile import TemporaryFile with open("out.txt") as f, TemporaryFile("w+") as t: h, ln = next(f), len(next(f).split(",")) header = h.strip().split(",") f.seek(0), next(f) header += range(ln) print(pd.read_csv(f, names=header)) 

What will give you:

  SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 \ 0 2015-07-15 16:41:56 0-0-0-0-3 1 2 0 5 59 0 1 2015-07-15 16:42:55 0-0-0-0-3 1 0 0 5 9 0 2 2015-07-15 16:43:55 0-0-0-0-3 1 0 0 5 5 0 3 2015-07-15 16:44:56 0-0-0-0-3 1 2 0 5 0 0 4 5 ... 13 14 15 16 17 18 19 20 21 22 0 0 0 ... 0 0 0 0 0 NaN NaN NaN NaN NaN 1 0 0 ... 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 0 0 ... 4 0 0 0 NaN NaN NaN NaN NaN NaN 3 0 0 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN [4 rows x 27 columns] 

Or you can clear the file before going to pandas:

 import pandas as pd from tempfile import TemporaryFile with open("in.csv") as f, TemporaryFile("w+") as t: for line in f: t.write(line.replace(" ", "")) t.seek(0) ln = len(line.strip().split(",")) header = t.readline().strip().split(",") header += range(ln) print(pd.read_csv(t,names=header)) 

What gives you:

  SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 4 5 ... 11 \ 0 2015-07-1516:41:56 0-0-0-0-3 1 2 0 5 59 0 0 0 ... 0 1 2015-07-1516:42:55 0-0-0-0-3 1 0 0 5 9 0 0 0 ... 0 2 2015-07-1516:43:55 0-0-0-0-3 1 0 0 5 5 0 0 0 ... 0 3 2015-07-1516:44:56 0-0-0-0-3 1 2 0 5 0 0 0 0 ... 0 12 13 14 15 16 17 18 19 20 0 0 0 0 0 0 0 NaN NaN NaN 1 50 0 NaN NaN NaN NaN NaN NaN NaN 2 0 4 0 0 0 NaN NaN NaN NaN 3 6 0 0 0 0 NaN NaN NaN NaN [4 rows x 25 columns] 

or reset columns will be all nana:

 print(pd.read_csv(f, names=header).dropna(axis=1,how="all")) 

Gives you:

  SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 \ 0 2015-07-15 16:41:56 0-0-0-0-3 1 2 0 5 59 0 1 2015-07-15 16:42:55 0-0-0-0-3 1 0 0 5 9 0 2 2015-07-15 16:43:55 0-0-0-0-3 1 0 0 5 5 0 3 2015-07-15 16:44:56 0-0-0-0-3 1 2 0 5 0 0 4 5 ... 8 9 10 11 12 13 14 15 16 17 0 0 0 ... 2 0 0 0 0 0 0 0 0 0 1 0 0 ... 2 0 0 0 50 0 NaN NaN NaN NaN 2 0 0 ... 2 0 0 0 0 4 0 0 0 NaN 3 0 0 ... 2 0 0 0 6 0 0 0 0 NaN [4 rows x 22 columns] 
+4
source

You can split the HISTOGRAM columns into the new DataFrame and concat into the original.

 print df SAMPLE_TIME, POS, OFF, \ 0 2015-07-15 16:41:56 0-0-0-0-3, 1, 1 2015-07-15 16:42:55 0-0-0-0-3, 1, 2 2015-07-15 16:43:55 0-0-0-0-3, 1, 3 2015-07-15 16:44:56 0-0-0-0-3, 1, HISTOGRAM 0 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0, 1 0,0,5,9,0,0,0,0,0,2,0,0,0,50,0, 2 0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0, 3 2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0 
 #create new dataframe from column HISTOGRAM h = pd.DataFrame([ x.split(',') for x in df['HISTOGRAM'].tolist()]) print h 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0 2 0 5 59 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 0 5 9 0 0 0 0 0 2 0 0 0 50 0 None None None None 2 0 0 5 5 0 0 0 0 0 2 0 0 0 0 4 0 0 0 None 3 2 0 5 0 0 0 0 0 0 2 0 0 0 6 0 0 0 0 None None #append to original, rename 0 column df = pd.concat([df, h], axis=1).rename(columns={0:'HISTOGRAM'}) print df HISTOGRAM HISTOGRAM 1 2 3 4 5 ... 10 \ 0 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0, 2 0 5 59 0 0 ... 0 1 0,0,5,9,0,0,0,0,0,2,0,0,0,50,0, 0 0 5 9 0 0 ... 0 2 0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0, 0 0 5 5 0 0 ... 0 3 2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0 2 0 5 0 0 0 ... 0 11 12 13 14 15 16 17 18 19 0 0 0 0 0 0 0 0 0 1 0 0 50 0 None None None None 2 0 0 0 4 0 0 0 None 3 0 0 6 0 0 0 0 None None [4 rows x 24 columns] 
+3
source

So how about this. I made csv from your sample data.

When I import the lines:

 with open('test.csv','rb') as f: lines = list(csv.reader(f)) headers, values =lines[0],lines[1:] 

to generate nice header names use this line:

 headers = [i or ind for ind, i in enumerate(headers)] 

therefore, due to the way (I assume) csv works, headers should have an empty string of string values. empty rows are evaluated as False, so this understanding returns numbered columns for each column without a header.

Then just do df:

 df = pd.DataFrame(values,columns=headers) 

which is as follows:

 11: SAMPLE_TIME POS OFF HISTOGRAM 4 5 6 7 8 9 \ 0 15/07/2015 16:41 0-0-0-0-3 1 2 0 5 59 0 0 0 1 15/07/2015 16:42 0-0-0-0-3 1 0 0 5 9 0 0 0 2 15/07/2015 16:43 0-0-0-0-3 1 0 0 5 5 0 0 0 3 15/07/2015 16:44 0-0-0-0-3 1 2 0 5 0 0 0 0 ... 12 13 14 15 16 17 18 19 20 21 0 ... 2 0 0 0 0 0 0 0 0 0 1 ... 2 0 0 0 50 0 2 ... 2 0 0 0 0 4 0 0 0 3 ... 2 0 0 0 6 0 0 0 0 [4 rows x 22 columns] 
-1
source

Assuming your data is in the foo.csv file, you can do the following. It has been tested against Pandas 0.17

 df = pd.read_csv('foo.csv', names=['sample_time', 'pos', 'off', 'histogram', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17'], skiprows=1) 
-2
source

Source: https://habr.com/ru/post/1238646/


All Articles