Dos2unix: binary character 0x04 found on line 1703

I download the file from OECD http://stats.oecd.org/Index.aspx?datasetcode=CRS1 ('CRS 2013 data.txt') by choosing Export-> Related files. I want to work with this file in Ubuntu (14.04 LTS).

When I run:

dos2unix CRS\ 2013\ data.txt 

I see:

 dos2unix: Binary symbol 0x0004 found at line 1703 dos2unix: Skipping binary file CRS 2013 data.txt 

I check the file encoding:

 file --mime-encoding CRS\ 2013\ data.txt 

and look:

 CRS 2013 data.txt: utf-16le 

I do:

 iconv -l | grep utf-16le 

which does not return anything I do:

 iconv -l | grep utf-16le 

which returns:

 UTF-16LE// 

Then I run:

 iconv --verbose -f UTF-16LE -t UTF-8 CRS\ 2013\ data.txt -o crs_2013_data_temp.txt 

and check:

 file --mime-encoding crs_2013_data_temp.txt 

and look:

 crs_2013_data_temp.txt: utf-8 

Then I try:

 dos2unix crs_2013_data_temp.txt 

and get:

 dos2unix: Binary symbol 0x04 found at line 1703 dos2unix: Skipping binary file crs_2013_data_temp.txt 

Then I will try to force it:

 dos2unix -f crs_2013_data_temp.txt 

This works, that is, dos2unix completes the conversion without having to give out / complain, but when I open the file, I see entries like "FoÃ" Ťa and à Śajnià Ťe ".

My question is why? Is it because the specification is not visible dos2unix? Because he is absent? Have I not made the right to transfer? How to convert this file (right?) So that I can read it.

+6
source share
2 answers

This 0x0004 character, which you see in your file, has nothing to do with the specification (by the way, this is normal) - this is the EOT (end of transfer) character from the C0 control set and was on this code since 7-bit ASCII was a new heat. (This is also the familiar EOF Control-D Unix sequence.)

Unfortunately, the pre dos2unix method of applying tr to a file to remove carriages will not work directly since the file is UTF-16; since iconv works for you, you can use it to convert to UTF-8 (which will work tr ), and then run this tr command:

 tr -d '\r' < crs_2013_data_temp.txt > crs_2013_data_unix.txt 

to get a text file into a Unix line termination agreement. You will need to keep track of any tools that you submit to the file, however, to make sure they are not choking on the Ctrl-D / EOT character; if they do, you can use

 tr -d '\004' < crs_2013_data_unix.txt > crs_2013_data_clean.txt 

to get rid of him.

How did he get there? I blame the Belgians for allowing him to penetrate the data that they provided to the OECD, which they probably entered using cat - > file or some other similar means. In addition, some text editors try to be too useful by hiding control characters, even if other tools help out when they see them, because they think you just stuffed a binary file that pretended to be text for a while.

+3
source

I think this command is ok for your problem:

 cat file | tr -d "\r" > new_file 
+1
source

Source: https://habr.com/ru/post/986159/


All Articles