Parse CSV with empty margins, escaped quotes and commas with awk

I am using happily gawk with FPAT. Here is the script I use for my examples:

#!/usr/bin/gawk -f BEGIN { FPAT="([^,]*)|(\"[^\"]+\")" } { for (i=1; i<=NF; i++) { printf "Record #%s, field #%s: %s\n", NR, i, $i } } 

Simple, without quotes

It works well.

 $ echo 'a,b,c,d' | ./test.awk Record #1, field #1: a Record #1, field #2: b Record #1, field #3: c Record #1, field #4: d 

With quotes

It works well.

 $ echo '"a","b",c,d' | ./test.awk Record #1, field #1: "a" Record #1, field #2: "b" Record #1, field #3: c Record #1, field #4: d 

With empty columns and quotation marks

It works well.

 $ echo '"a","b",,d' | ./test.awk Record #1, field #1: "a" Record #1, field #2: "b" Record #1, field #3: Record #1, field #4: d 

With escaped quotation marks, empty columns and quotation marks

It works well.

 $ echo '"""a"": aaa","b",,d' | ./test.awk Record #1, field #1: """a"": aaa" Record #1, field #2: "b" Record #1, field #3: Record #1, field #4: d 

With a column containing escaped quotes and ending with a comma

Fails.

 $ echo '"""a"": aaa,","b",,d' | ./test.awk Record #1, field #1: """a"": aaa Record #1, field #2: "," Record #1, field #3: b" Record #1, field #4: Record #1, field #5: d 

Expected Result:

 $ echo '"""a"": aaa,","b",,d' | ./test_that_would_be_working.awk Record #1, field #1: """a"": aaa," Record #1, field #2: "b" Record #1, field #4: Record #1, field #5: d 

Is there a regex for FPAT that will make this work, or is it just not supported by awk?

The pattern will be " followed by only one. " Searching for a regular expression class works one character at a time, so it cannot match "" .

I think there might be a lookaround option, but I'm not good enough to make it work.

+5
source share
1 answer

Since awk FPAT does not know images, you need to be explicit in your templates. This will do:

 FPAT="[^,\"]*|\"([^\"]|\"\")*\"" 

Explanation:

 [^,\"]* # match 0 or more times any character except , and " | # OR \" # match '"' ([^\"] # followed by 0 or more anything but '"' | # OR \"\" # '""' )* \" # ending with '"' 

Now test it:

 $ cat tst.awk BEGIN { FPAT="[^,\"]*|\"([^\"]|\"\")*\"" } { for (i=1; i<=NF; i++){ printf "Record #%s, field #%s: %s\n", NR, i, $i } } $ echo '"""a"": aaa,","b",,d' | awk -f tst.awk Record #1, field #1: """a"": aaa," Record #1, field #2: "b" Record #1, field #3: Record #1, field #4: d 
+2
source

Source: https://habr.com/ru/post/1273085/


All Articles