How to massage inserts from CSV when some fields have a new line character?

I have a CSV dump from another database that looks like this (id, name, notes):

1001, John Smith, 15 Main Street
1002, Jane Smith, 2010 Rockliffe Dr.
Pleasantville, IL
USA"
1003, Bill Karr, 2820 West Ave.

The last field may contain carriage returns and commas, in which case it is surrounded by double quotes. And I need to keep these returns and commas.

I use this code to import CSV into a table:

BULK INSERT CSVTest FROM 'c:\csvfile.csv' WITH ( FIELDTERMINATOR = ',', ROWTERMINATOR = '\n' ) 

SQL Server 2005 bulk insert cannot understand that carriage return inside quotes is not a line terminator .
How to overcome?


UPDATE :
It seems that the only way to keep line breaks inside the field is to use a different line separator. So, I want to mark all line breaks by separating the lines, putting the receiver in front of them. How can I change my CSV to look like this?

1001, John Smith, 15 Main Street | 1002, Jane Smith, 2010 Rockliffe Dr.
Pleasantville, IL
USA "|
1003, Bill Carr, 2820 West Ave. |

+4
source share
6 answers

OK, here is a small Java program that I end up writing to solve the problem.
Comments, corrections and optimizations are welcome.

 import java.io.*; public class PreBulkInsert { public static void main(String[] args) { if (args.length < 3) { System.out.println ("Usage:"); System.out.println (" java PreBulkInsert input_file output_file separator_character"); System.exit(0); } try { boolean firstQuoteFound = false; int fromIndex; int lineCounter = 0; String str; BufferedReader in = new BufferedReader(new FileReader(args[0])); BufferedWriter out = new BufferedWriter(new FileWriter(args[1])); String newRowSeparator = args[2]; while ((str = in.readLine()) != null) { fromIndex = -1; do { fromIndex = str.indexOf('"', fromIndex + 1); if (fromIndex > -1) firstQuoteFound = !firstQuoteFound; } while (fromIndex > -1); if (!firstQuoteFound) out.write(str + newRowSeparator + "\r\n"); else out.write(str + "\r\n"); lineCounter++; } out.close(); in.close(); System.out.println("Done! Total of " + lineCounter + " lines were processed."); } catch (IOException e) { System.out.println(e.getMessage()); System.exit(1); } } } 
0
source

Bulk operations on SQL Server do not support CSV, even if they can import them if the files are carefully formatted. My suggestion was to enclose all field values ​​in quotation marks. BULK INSERT can then enable the carriage return to the field value. If this is not the case, then the next solution might be the Integration Services package.

For more information, see Preparing Data for Mass Export or Import .

+1
source

you can massage these line breaks into one line with a script, for example, you can use GNU sed to remove line breaks, for example,

 $ more file 1001,John Smith,15 Main Street 1002,Jane Smith,"2010 Rockliffe Dr. Pleasantville, IL USA" 1003,Bill Karr,"2820 West Ave" $ sed '/"/!s/$/|/;/.*\".*[^"]$/{ :a;N };/"$/ { s/$/|/ }' file 1001,John Smith,15 Main Street| 1002,Jane Smith,"2010 Rockliffe Dr. Pleasantville, IL USA"| 1003,Bill Karr,"2820 West Ave"| 

then you can insert the volume.

Edit:

Save this: /"/!s/$/|/;/.*\".*[^"]$/{ :a;N };/"$/ { s/$/|/ } in the file, say myformat.sed . then do it on the command line

c:\test> sed.exe -f myformat.sed myfile

+1
source

According to the source of all knowledge (Wikipedia), csv uses newlines to separate records. So you have an invalid csv.

My suggestion is that you write a perl program to process your file and add each entry in db.

If you are not a perl person, then you can use the programming site or see if some SO person will write a parsing section of the program for you.

Added:

Possible Solution

Since the OP claims that it can modify the input file, I would change all new lines that do not follow the "reserved char sequence, for example XXX

It can be an automatic replacement in many editors. Windows UltraEdit includes regexp search / replace feature

Then import into dbms, since you will no longer have inline newlines.

Then use SQL Replace to change XXX to newlines.

0
source

You cannot import this if the CSV is not in a valid format. Thus, you need to either fix the dump or manually using the search and replace the correction of unwanted newline characters.

0
source

If you have control over the contents of the CSV file, you can replace the line breaks in the ( CRLF ) field with a non-linebreak character (possibly only CR or LF ), and then run the script after import to replace them with CRLF again.

This is how MS Office products (Excel, Access) deal with this problem.

0
source

Source: https://habr.com/ru/post/1305062/


All Articles