Strict definition for reading / writing CSV files

I wrote my own CSV reader / writer in C to store entries in a character column in an ODBC database. Unfortunately, I found many edge cases that tour my implementation, and I came to the conclusion that my problem is that I did not strictly define the rules for CSV. I read RFC4180, but it seems incomplete and does not allow ambiguity.

For example, should "" read an empty token or double quote? Do quotes correspond outward or from left to right? What should I do with an input string that has unsurpassed single quotes? The real mess starts when I have nested tokens that double the symbols of escaped quotes.

I really need the ultimate CSV standard that I can implement in code. Every time I feel like I have nailed all the corners, I find another one. I am sure that this problem has been solved and solved many times by superior minds for mine, did anyone write a strict definition of CSV that I can implement in code? I understand that C is not an ideal language here, but at this point I have no choice regarding the compiler; I also can not use a third-party library (if it does not compile with C-90). Boost is not an option since my compiler does not support C ++. I assumed that CSV is for XML, but it seems to be crowded to store multiple tokens in a record with 256 database characters. Has anyone made the final CSV specification?

+6
source share
3 answers

There is no standard (see Wikipedia article, in particular http://en.wikipedia.org/wiki/Comma-separated_values#Lack_of_a_standard ), therefore, to use CSV, you need to follow the general principle of conservatism in what you generate, and liberal in what you accept. In particular:

  • Do not use quotation marks for empty fields. Just write an empty field (two adjacent delimiters or a delimiter in the first / last position of the line).
  • Specify any field that contains quotation mark, comma, or new line.
+1
source

Find the most reputable CSV library you trust and read the source. CSV is not so complex that you cannot understand its rules from a comprehensive reading of the original implementation. I was pleased with Java opencsv . Perl here etc.

0
source

In accordance with RFC 4180, fields must be parsed from left to right in order to correctly interpret the double quote. In some contexts, "" is a hidden double quote (if inside the field with quotes), otherwise it is either an empty string or two double quotes (if they are inside the value of a non-empty field).

For example, consider a file with 4 entries (1 column):

 "field""value" CRLF "" CRLF field""value CRLF "field value" extra CRLF 
  • "field""value" - read as field"value
  • "" - should be read as an empty string
  • field""value - read as field""value
  • "field value" extra - can be read as field value extra or you can reject it

Record 4 is indeed an invalid field, so you can accept or reject it.

When you start reading a field, you need to check whether the first character is read double or not. If the first character is a double quotation mark, the value of the field is indicated, and you need to read until you find the hidden closing double quotation mark. In this case, you can ignore new lines and comma characters, since the field is quoted - it ends only when the double quote is closed.

If the first character is not a double quote, then all double quotes in the field value should be treated as literal double qoutes. In this case, you reach the end of the field when you encounter a comma or a new line character.

Based on this, I would recommend always specifying all fields when writing records and writing the correct parser to analyze records when reading data. This way you can store any data in your CSV files (even multi-line text with embedded quotes), and your format will be transparent. When you read a CSV file, I would not be able to process all the files that cannot be parsed correctly - if it is a database, you can expect users to not manually record notes if they do not know what they are doing.

0
source

Source: https://habr.com/ru/post/946640/


All Articles