How to use python csv module to split data separated by double pipe

I have data that looks like this:

"1234"||"abcd"||"a1s1" 

I am trying to read and write using Python csv reader and writer. Since the csv module delimiter is limited to one char, is there a way to extract data in a clean way? I cannot afford to delete empty columns, as this is a massive array of data that will be processed over time. Any thoughts will be helpful.

+6
source share
4 answers

Documents and experiments prove that only single-character delimiters are allowed.

Since cvs.reader accepts any object that supports the iterator protocol, you can use the generator syntax to replace || -s | -s, and then pass this generator to the reader:

 def read_this_funky_csv(source): # be sure to pass a source object that supports # iteration (eg a file object, or a list of csv text lines) return csv.reader((line.replace('||', '|') for line in source), delimiter='|') 

This code is pretty efficient as it runs on the same CSV line at a time if your CSV source gives lines that do not exceed your RAM :)

+12
source
 >>> import csv >>> reader = csv.reader(['"1234"||"abcd"||"a1s1"'], delimiter='|') >>> for row in reader: ... assert not ''.join(row[1::2]) ... row = row[0::2] ... print row ... ['1234', 'abcd', 'a1s1'] >>> 
+2
source

Unfortunately, the delimiter is represented by a character in C. This means that in Python it is impossible to be anything other than a single character. The good news is that you can ignore values ​​that are null:

 reader = csv.reader(['"1234"||"abcd"||"a1s1"'], delimiter='|') #iterate through the reader. for x in reader: #you have to use a numeric range here to ensure that you eliminate the #right things. for i in range(len(x)): #Odd indexes will be discarded. if i%2 == 0: x[i] #x[i] where i%2 == 0 represents the values you want. 

There are other ways to do this (a function can be written for one), but it gives you the necessary logic.

+1
source

If your data literally looks like an example (fields never contain "||" and are always quoted) and you can endure quotation marks or want to cut them later, just use .split

 >>> '"1234"||"abcd"||"a1s1"'.split('||') ['"1234"', '"abcd"', '"a1s1"'] >>> list(s[1:-1] for s in '"1234"||"abcd"||"a1s1"'.split('||')) ['1234', 'abcd', 'a1s1'] 

csv is only required if the delimiter is inside the fields or removes the optional quotation marks around the fields

+1
source

Source: https://habr.com/ru/post/890565/


All Articles