How to use square brackets as quotation mark in Pandas.read_csv

Say I have a text file that looks like this:

Item,Date,Time,Location 1,01/01/2016,13:41,[45.2344:-78.25453] 2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242] 3,01/10/2016,01:27,[51.2344:-86.24432] 

What I would like to do is read that using pandas.read_csv , but the second line throws an error. Here is the code I'm using now:

 import pandas as pd df = pd.read_csv("path/to/file.txt", sep=",", dtype=str) 

I tried to set the quotechar to "[", but this obviously just absorbs the lines until the next open parenthesis and adding a closing parenthesis leads to the error "string of length 2 found". Any insight would be greatly appreciated. Thanks!

Update

Three main solutions were proposed: 1) Give a long range of names for the data frame so that all data can be read and then process the data, 2) Find the values ​​in square brackets and put quotation marks around it, or 3) replace the first number of commas with a semicolon.

In general, I don’t think that option 3 is a viable solution in general (albeit excellent for my data), because: a) what if I quoted the values ​​in one column containing commas, and b) what if my column with square brackets - is this not the last column? This leaves solutions 1 and 2. I think that solution 2 is more readable, but solution 1 was more efficient in just 1.38 seconds, compared to solution 2, which lasted 3.02 seconds. Tests were performed in a text file containing 18 columns and more than 208,000 rows.

+5
source share
3 answers

I think that you can replace 3 appearances first , in each line of the file with ; , and then use the sep=";" in read_csv :

 import pandas as pd import io with open('file2.csv', 'r') as f: lines = f.readlines() fo = io.StringIO() fo.writelines(u"" + line.replace(',',';', 3) for line in lines) fo.seek(0) df = pd.read_csv(fo, sep=';') print df Item Date Time Location 0 1 01/01/2016 13:41 [45.2344:-78.25453] 1 2 01/03/2016 19:11 [43.3423:-79.23423,41.2342:-81242] 2 3 01/10/2016 01:27 [51.2344:-86.24432] 

Or you can try this complex approach, because the main problem is that the separator between the values ​​in lists matches the separator of other column values.

So you need post processing:

 import pandas as pd import io temp=u"""Item,Date,Time,Location 1,01/01/2016,13:41,[45.2344:-78.25453] 2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242,41.2342:-81242] 3,01/10/2016,01:27,[51.2344:-86.24432]""" #after testing replace io.StringIO(temp) to filename #estimated max number of columns df = pd.read_csv(io.StringIO(temp), names=range(10)) print df 0 1 2 3 4 \ 0 Item Date Time Location NaN 1 1 01/01/2016 13:41 [45.2344:-78.25453] NaN 2 2 01/03/2016 19:11 [43.3423:-79.23423 41.2342:-81242 3 3 01/10/2016 01:27 [51.2344:-86.24432] NaN 5 6 7 8 9 0 NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN 2 41.2342:-81242] NaN NaN NaN NaN 3 NaN NaN NaN NaN NaN 
 #remove column with all NaN df = df.dropna(how='all', axis=1) #first row get as columns names df.columns = df.iloc[0,:] #remove first row df = df[1:] #remove columns name df.columns.name = None #get position of column Location print df.columns.get_loc('Location') 3 #df1 with Location values df1 = df.iloc[:, df.columns.get_loc('Location'): ] print df1 Location NaN NaN 1 [45.2344:-78.25453] NaN NaN 2 [43.3423:-79.23423 41.2342:-81242 41.2342:-81242] 3 [51.2344:-86.24432] NaN NaN #combine values to one column df['Location'] = df1.apply( lambda x : ', '.join([e for e in x if isinstance(e, basestring)]), axis=1) #subset of desired columns print df[['Item','Date','Time','Location']] Item Date Time Location 1 1 01/01/2016 13:41 [45.2344:-78.25453] 2 2 01/03/2016 19:11 [43.3423:-79.23423, 41.2342:-81242, 41.2342:-8... 3 3 01/10/2016 01:27 [51.2344:-86.24432] 
+1
source

I can't think of a way to trick the CSV parser into accepting various open / closed quote characters, but you can get away with a fairly simple preprocessing step:

 import pandas as pd import io import re # regular expression to capture contents of balanced brackets location_regex = re.compile(r'\[([^\[\]]+)\]') with open('path/to/file.txt', 'r') as fi: # replaced brackets with quotes, pipe into file-like object fo = io.StringIO() fo.writelines(unicode(re.sub(location_regex, r'"\1"', line)) for line in fi) # rewind file to the beginning fo.seek(0) # read transformed CSV into data frame df = pd.read_csv(fo) print df 

This gives you a result like

  Date_Time Item Location 0 2016-01-01 13:41:00 1 [45.2344:-78.25453] 1 2016-01-03 19:11:00 2 [43.3423:-79.23423, 41.2342:-81242] 2 2016-01-10 01:27:00 3 [51.2344:-86.24432] 

Edit If memory is not a problem, you are better off pre-processing the data in bulk rather than line by line, as is done in Max. the answer .

 # regular expression to capture contents of balanced brackets location_regex = re.compile(r'\[([^\[\]]+)\]', flags=re.M) with open('path/to/file.csv', 'r') as fi: data = unicode(re.sub(location_regex, r'"\1"', fi.read())) df = pd.read_csv(io.StringIO(data)) 

If you know in advance that the only brackets in the document are those that surround the location coordinates, and that they are guaranteed to be balanced, then you can simplify it even more (Max offers a phased version of this, but I think iteration is not needed):

 with open('/path/to/file.csv', 'r') as fi: data = unicode(fi.read().replace('[', '"').replace(']', '"') df = pd.read_csv(io.StringIO(data)) 

Below are the synchronization results that I got with a data set of 200 columns by 3 columns. Each time averaged over 10 trials.

  • post-processing data frame solution jezrael ): 2.19s
  • line by line regex: 1.36s
  • volumetric regular expression: 0.39 s
  • volume line replace: 0.14 s
+1
source

We can use simple balanced square brackets for tricks - quotes with double quotes:

 import re import six import pandas as pd data = """\ Item,Date,Time,Location,junk 1,01/01/2016,13:41,[45.2344:-78.25453],[aaaa,bbb] 2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242],[0,1,2,3] 3,01/10/2016,01:27,[51.2344:-86.24432],[12,13] 4,01/30/2016,05:55,[51.2344:-86.24432,41.2342:-81242,55.5555:-81242],[45,55,65]""" print('{0:-^70}'.format('original data')) print(data) data = re.sub(r'(\[[^\]]*\])', r'"\1"', data, flags=re.M) print('{0:-^70}'.format('quoted data')) print(data) df = pd.read_csv(six.StringIO(data)) print('{0:-^70}'.format('data frame')) pd.set_option('display.expand_frame_repr', False) print(df) 

Output:

 ----------------------------original data----------------------------- Item,Date,Time,Location,junk 1,01/01/2016,13:41,[45.2344:-78.25453],[aaaa,bbb] 2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242],[0,1,2,3] 3,01/10/2016,01:27,[51.2344:-86.24432],[12,13] 4,01/30/2016,05:55,[51.2344:-86.24432,41.2342:-81242,55.5555:-81242],[45,55,65] -----------------------------quoted data------------------------------ Item,Date,Time,Location,junk 1,01/01/2016,13:41,"[45.2344:-78.25453]","[aaaa,bbb]" 2,01/03/2016,19:11,"[43.3423:-79.23423,41.2342:-81242]","[0,1,2,3]" 3,01/10/2016,01:27,"[51.2344:-86.24432]","[12,13]" 4,01/30/2016,05:55,"[51.2344:-86.24432,41.2342:-81242,55.5555:-81242]","[45,55,65]" ------------------------------data frame------------------------------ Item Date Time Location junk 0 1 01/01/2016 13:41 [45.2344:-78.25453] [aaaa,bbb] 1 2 01/03/2016 19:11 [43.3423:-79.23423,41.2342:-81242] [0,1,2,3] 2 3 01/10/2016 01:27 [51.2344:-86.24432] [12,13] 3 4 01/30/2016 05:55 [51.2344:-86.24432,41.2342:-81242,55.5555:-81242] [45,55,65] 

UPDATE : if you are sure that all square brackets are balances, we do not need to use RegEx:

 import io import pandas as pd with open('35948417.csv', 'r') as f: fo = io.StringIO() data = f.readlines() fo.writelines(line.replace('[', '"[').replace(']', ']"') for line in data) fo.seek(0) df = pd.read_csv(fo) print(df) 
+1
source

Source: https://habr.com/ru/post/1244914/


All Articles