Split file in bash after unescaped line feed

Question

Split file in bash after unescaped line feed

Given the common bash-tools, it is easy to split a large file (in my case, a MySQL dump, and therefore a TSV file) into smaller parts using the split command. In addition, this command supports splitting a file after n newlines (i.e., the -l argument). But this command does not distinguish between escaped and unescaped newline characters and, thus, can split one row of a table into two incomplete parts.

Example (TSV with 2 columns)

 cool 2014-12-15 17:31:00 do not censor it ...^M\\n 2016-01-24 22:33:00 watch out ari, you've got compeition! hahah 2001-12-05 19:11:01 Oh God, the poor guy! xD\\nCan't wait to watch this! 2011-07-11 22:01:20 wish i could do that.\\n 2001-02-07 00:24:11 Funny! I will use this reason when I drink something in other houses 2015-06-10 12:20:00

As you can see, there are two columns (the first contains a comment and the second contains a date), which are separated by a tab. I only visualized newline escaped lines, tabs and incomplete translation lines are not printed. If you put these lines in a file and separate it (for example, split example.tsv -l 1 ), you will get 9 files, but there are only 6 comments (3 contain escaped lines)! This is because escaped newline characters are treated as regular lines with a backslash prefix. This is a huge problem for me, because splitting a file can lead to incomplete table rows in the output files.

Is it possible to ignore the escaped lines of a new line or does someone know another command that can do this?

+5

split bash escaping

NaN Dec 08 '17 at 17:56

source share

1 answer

John1024 · Accepted Answer · 2017-12-08T19:15:53+0000

This will break the file every 20 lines (or whatever you set n ), but not in lines ending with a backslash:

 awk -vn=20 'NR==1 || (c>n && !(last~/\\$/)){c=0; close(f); f="file" ++count ".out"} {c++; print>f; last=$0}' file

How it works

-vn=20
This creates an awk n variable that we will use to decide when to split the file.
NR==1 || (c>n && !(last~/\\$/)){c=0; close(f); f="file" ++count ".out"}
Each time we need to start a new file, we (a) set the line counter, c to zero, (b) close the previous file and (c) determine the name for the next file.
We need to start a new file when (i) we are on the first line of input, NR==1 , or when (ii) the line counter c exceeds the limit n and the last line did not end with \ .
c++; print>f; last=$0
This increments the line counter, c , prints the current line in file f and updates last to the value of the current line.

Example

Try this test file:

 $ cat file text1 2014-12-15 17:31:01 text2\ 2014-12-15 17:31:02 text3 2014-12-15 17:31:03 text4a\ text4b\ 2014-12-15 17:31:04 text5 2014-12-15 17:31:05

Now run our team. To keep the example short, we set n=2 :

 $ awk -vn=2 'NR==1 || (c>n && !(last~/\\$/)){c=0; close(f); f="file" ++count ".out"} {c++; print>f; last=$0}' file

After running the command, new files appear in the directory:

 $ ls file file1.out file2.out file3.out

New files contain old content, broken every 2 lines, except that they are not divided into lines ending with \ :

 $ cat file1.out text1 2014-12-15 17:31:01 text2\ 2014-12-15 17:31:02 $ cat file2.out text3 2014-12-15 17:31:03 text4a\ text4b\ 2014-12-15 17:31:04 $ cat file3.out text5 2014-12-15 17:31:05

Split file in bash after unescaped line feed

How it works

Example

More articles: