How to remove \ n characters from a file?

I have a question that should make most people go β€œWTF?”, But I have it nonetheless.

I have a bunch of data files from a provider. This is in a custom flat file format, which is claimed to be a CSV, except that it is not comma separated and the values ​​are not quoted. So, not really CSV at all.

foo,bar,baz alice,bob,chris 

And so on, except for a much longer and less interesting one. The problem is that in some entries new lines are embedded (!!!):

 foo,bar rab,baz alice,bob,chris 

It is assumed that these are two records in three fields. Usually I just say β€œNo, this is stupid,” but I walked perplexed closer and found that this was actually a different end of the end than the actual sequence of line endings:

 foo,bar\n rab,baz\r\n alice,bob,chris\r\n 

Note the \ n in the first line. I decided that this was true for all cases that I found in the inline newlines. So, I need to basically execute s/\n$// (I tried this particular command, did nothing).

Note. In fact, I don't care about the contents of the fields, so replacing a new line is no different. I just need each line in the file to have the same number of entries (ideally, in the same place).

I have an existing solution in a tool that I wrote for processing files:

 Guid g = Guid.NewGuid(); string data = File.ReadAllText(file, Encoding.GetEncoding("Latin1")); data = data.Replace("\r\n", g.ToString()); //just so I have a unique placeholder data = data.Replace("\n", ""); data = data.Replace(g.ToString(), "\r\n"); 

However, this fails for files larger than a gigabyte. (In addition, I did not profile it, but I suspect that the dog is also slow).

I have the following tools at my disposal:

  • cygwin tools (sed, grep, etc.)
  • .NET

What is the best way to do this?

+4
source share
4 answers

Instead of reading the whole thing in memory as a large (potentially huge) line, consider a stream-based approach instead.

Open the input stream and read the line at a time, making your replacements as needed. Open the output stream and write the modified line to it. Sort of:

 static void Main( string[] args ) { using( var inFs = File.OpenRead( @"C:\input.txt" ) ) using( var reader = new StreamReader( inFs ) ) using( var outFs = File.Create( @"C:\output.txt" ) ) using( var writer = new StreamWriter( outFs ) ) { int cur; char last = '0'; while( ( cur = reader.Read() ) != -1 ) { char next = (char)reader.Peek(); char c = (char)cur; if( c != '\n' || last == '\r' ) writer.Write( c ); last = c; } } } 
+5
source

This is an awful lot of code to make something so simple.

Try this instead.

 tr -d '\n' <dirtyfile >cleanfile 
+2
source

Here is the StreamReader class that seems to be doing what I want. Note that this is probably incredibly domain specific, so it may or may not be useful:

 class BadEOLStreamReader : StreamReader { private int pushback = -1; public BadEOLStreamReader(string file, Encoding encoding) : base(file, encoding) { } public override int Peek() { if (pushback != -1) { var r = pushback; pushback = -1; return r; } return base.Peek(); } public override int Read() { if (pushback != -1) { var r = pushback; pushback = -1; return r; } skip: var ret = base.Read(); if (ret == 13) { var ret2 = base.Read(); if (ret2 == 10) { //it good, push back the 10 pushback = ret2; return ret; } pushback = ret2; //skip it goto skip; } else if (ret == 10) { //skip it goto skip; } else { return ret; } } } 
0
source

EDIT : after some tests, the awk solution gives better results in terms of speed.

The standard file / input filter on UNIX / Linux / Cygwin can hardly handle a binary file. To do this with filters, you need to convert your file to HEX, edit it with sed (or awk , see the second solution below) and convert it back to the original data. This should do it:

 xxd -c1 -p file.txt | sed -n -e '1{h}' -e '${x;G;p;d}' \ -e '2,${x;G;/^0d\n0a$/{P;b};/\n0a$/{P;s/.*//;x;b};P}' | xxd -r -p 

Well, this is not easy to understand, let's start with the simple parts:

  • xxd -c1 -p file.txt converts file.txt from binary to HEX, one byte per line.
  • xxd -r -p returns the conversion.
  • sed replaces a \n (0a in HEX), which is not preceded by a \r (0d in HEX) with nothing.

The idea behind the sed part is to store the previous byte in hold space and deal with both the previous and current bytes:

  • On the first line, save the line (byte) on hold.
  • On the last line, type both bytes in the correct order ( x;G;p ) and stop the script ( d ).
  • For the lines between them, after the current byte in the hold space and two bytes (previous and current) in the template space ( x;G ), 3 cases are possible:
    • If it is \r\n , then type \r while holding \n on hold for the next loop and stop this loop ( b ).
    • If it ends with \n (this means that it did not start with \r ), save the empty line in hold space and stop this loop (command b )
    • Repeat printing the 1st character.

In awk it's easier to understand:

 xxd -c1 -p file.txt | awk 'NR > 1 && $0 == "0a" && p != "0d" {$0 = ""} NR > 1 {print p} {p = $0} END{print p}' | xxd -r -p 

It can be tested with:

 printf "foo,bar\nrab,baz\r\nalice,bob,chris\r\n" | xxd -c1 -p | sed -n -e '1{h}' -e '${x;G;p;d}' \ -e '2,${x;G;/^0d\n0a$/{P;b};/\n0a$/{P;s/.*//;x;b};P}' | xxd -r -p 

or

 printf "foo,bar\nrab,baz\r\nalice,bob,chris\r\n" | xxd -c1 -p | awk 'NR > 1 && $0 == "0a" && p != "0d" {$0 = ""} NR > 1 {print p} {p = $0} END{print p}' | xxd -r -p 
0
source

Source: https://habr.com/ru/post/1443071/


All Articles