I have a question that should make most people go βWTF?β, But I have it nonetheless.
I have a bunch of data files from a provider. This is in a custom flat file format, which is claimed to be a CSV, except that it is not comma separated and the values ββare not quoted. So, not really CSV at all.
foo,bar,baz alice,bob,chris
And so on, except for a much longer and less interesting one. The problem is that in some entries new lines are embedded (!!!):
foo,bar rab,baz alice,bob,chris
It is assumed that these are two records in three fields. Usually I just say βNo, this is stupid,β but I walked perplexed closer and found that this was actually a different end of the end than the actual sequence of line endings:
foo,bar\n rab,baz\r\n alice,bob,chris\r\n
Note the \ n in the first line. I decided that this was true for all cases that I found in the inline newlines. So, I need to basically execute s/\n$// (I tried this particular command, did nothing).
Note. In fact, I don't care about the contents of the fields, so replacing a new line is no different. I just need each line in the file to have the same number of entries (ideally, in the same place).
I have an existing solution in a tool that I wrote for processing files:
Guid g = Guid.NewGuid(); string data = File.ReadAllText(file, Encoding.GetEncoding("Latin1")); data = data.Replace("\r\n", g.ToString());
However, this fails for files larger than a gigabyte. (In addition, I did not profile it, but I suspect that the dog is also slow).
I have the following tools at my disposal:
- cygwin tools (sed, grep, etc.)
- .NET
What is the best way to do this?