Parsing a TSV file

I need to parse a file in TSV format (values ​​separated by tab). I use regex to split a file into each line, but I cannot find a satisfactory one to parse every line. So far I have come up with this:

(?<g>("[^"]+")+|[^\t]+)

But this does not work if the item in the string has more than two consecutive double quotes.

Here's how the file is formatted: each element is separated by a table. If an item contains a tab, it is enclosed in double quotation marks. If the item contains a double quote, it doubles. But sometimes an element contains 4 hidden double quotes, and the above expression splits the element into 2 different ones.

<strong> Examples:

item1ok "item" "2" "oK"

correctly analyzed for 2 elements: item1ok and item "2" ok (after trimming unnecessary quotes), but:

item1oK "item" "" 2oK "

parsed into 3 elements: item1ok , element and 2ok (after trimming again).

Does anyone know how to make regex match this case? Or is there another solution for simple TSV analysis? (I do it in C #).

+3
source share
4 answers

You can use TextFieldParser . This is technically a VB assembly, but you can use it even in C #, referencing the assembly Microsoft.VisualBasic.FileIO.

.

+7

, CSV/TSV ( String.Split), " Fast CSV Reader" " FileHelpers".

, ( , , , ).

+6
0

#, ( python)

txt = 'item1ok\t"item""2""oK"\titem1oK\t"item""""2oK"\tsomething else'
regex = '''
(?:                    # definition of a field
 "((?:[^"]|"")*)"   # either a double quoted field (allowing consecutive "")
 |                  # or
 ([^"]*)            # any character except a double quote
)                      # end of field
(?:$|\t)               # each field followed by a tab (except the last one)
'''
r = re.compile(regex, re.X)
# now find each match, and replace "" by " and remove trailing \t
# remove also the latest entry in the list (empty string)
columns = [t[0].replace('""', '"') if t[0] != '' else t[1].strip() for t in r.findall(txt)][:-1]
print columns
# prints: ['item1ok', 'item"2"oK', 'item1oK', 'item""2oK', 'something else']
-1

Source: https://habr.com/ru/post/1736158/


All Articles