I need to parse a file in TSV format (values separated by tab). I use regex to split a file into each line, but I cannot find a satisfactory one to parse every line. So far I have come up with this:
(?<g>("[^"]+")+|[^\t]+)
But this does not work if the item in the string has more than two consecutive double quotes.
Here's how the file is formatted: each element is separated by a table. If an item contains a tab, it is enclosed in double quotation marks. If the item contains a double quote, it doubles. But sometimes an element contains 4 hidden double quotes, and the above expression splits the element into 2 different ones.
<strong> Examples:
item1ok "item" "2" "oK"
correctly analyzed for 2 elements: item1ok and item "2" ok (after trimming unnecessary quotes), but:
item1oK "item" "" 2oK "
parsed into 3 elements: item1ok , element and 2ok (after trimming again).
Does anyone know how to make regex match this case? Or is there another solution for simple TSV analysis? (I do it in C #).
source
share