CSV analysis for embedded double quotes

I wrote a simple CSV parser. But after viewing the wiki page in CSV formats, I noticed some โ€œextensionsโ€ in the basic format. In particular, the embedded comma separated by double quotes. I managed to parse them, however there is a second problem: inline double quotes.

Example:

12345, "ABC", "IJK" "XYZ" โ†’ [1234] and [ABC, "IJK" XYZ]

I can't seem to find the right way to distinguish between the enclosed double quote and nothing. So my question is: what is the correct method / algorithm for parsing CVS formats like above?

+4
source share
5 answers

As I usually think of it, basically looking at a quoted value as a single, unquoted value or a sequence of double quotes that form a value connected by quotation marks. I.e

  • to analyze the next atom in a row:
    • read up to the first character without spaces
    • if the current character is not a quote:
      • mark the current place
      • read next comma or new line
      • returns the text between the label and the character before the comma (if necessary, spaces)
    • if the current character is a quote:
      • create an empty line buffer
      • while the current character is not a quote
        • mark the current position +1 (skip the quote character)
        • read the following quote
        • if the buffer is not empty, add a quote to it
        • add text between the label and the character before the current position to the buffer (to remove both quotes)
        • promote one character (minus the quote just read)
      • read next comma or new line
      • return buffer

essentially separate each double quotation mark of the segment of the quoted string and then join them together with the quotation mark. thus: "ABC, ""IJK"" XYZ" becomes ABC, IJK , XYZ , which in turn becomes ABC, "IJK" XYZ

+5
source

I would do this using one character ahead, so if you are looking at a string and finding a double quote, look at the next character to see if it is also a double quote. If so, then the pair is a single character with a double character at the output. If this is some other character, you look at the end of the line with quotes (and hopefully the next character is a comma!). Be sure to consider the end of line condition while also looking at the next character.

+2
source

A double double quote ( "" ) is a literal double quote, and a single double quote ( " ) is used to include text (including commas).

There is a regular expression for the csv field if this makes the task easier:

 ([^",\n][^,\n]*)|"((?:[^"]|"")+)" 

Group 1 will contain the field if it is not in quotation marks, group 2 will contain the field if it is in quotation marks, minus the surrounding quotation marks. In this case, just replace all instances of "" with " .

+1
source

If you find a double quote, then you need to look for a double quote at the end of the word / line. If you cannot find, then there is a mistake. The same goes for the quote.

I suggest you try Flex / Bison to write a parser for the CSV file. Both tools will help you generate the parser, and then you can use the C files with the parser and call it from your C ++ program. In Flex, you create a scanner that can find your tokens, such as the word โ€œwordโ€ or โ€œwordโ€. In Bison, you define the syntax.

+1
source

I suggest reading: Stop rolling my own CSV analyzer and CSV RFC . The first one is really someone who wants you to use your C # CSV parser, but still explains many problems.

Your parser should analyze the character at a time. I used a double bool strategy for my parser in D. Each quote switches the weather indicated on the line or not. When in the specified cell you indicate when you click on the quote and turn off quoting. If the next character is a quote, quoting is included, a quote is added to the result, and the flag is disabled. If the next character is not a quote, the flag is turned off and therefore quoted.

+1
source

Source: https://habr.com/ru/post/1332622/


All Articles