Data.table :: fread does not like missing values ​​in the first column

Is this an error in data.table::fread (version 1.9.2) or an inappropriate user wait / error?

Consider this trivial example, where I have a table of values, TAB , separated by possibly missing values. If there are no values ​​in the first column, fread upset, but if the missing values ​​are in another place, I return data.table , I expect:

 # Data with missing value in first column, third row and last column, second row: 12 876 19 23 39 15 20 fread("12 876 19 23 39 15 20") #Error in fread("12\t876\t19\n23\t39\t\n\t15\t20") : # Not positioned correctly after testing format of header row. ch=' ' # Data with missing values last column, rows two and three: "12 876 19 23 39 15 20 " fread( "12 876 19 23 39 15 20 " ) # V1 V2 V3 #1: 12 876 19 #2: 23 39 NA #3: 15 20 NA # Returns as expected. 

Is this an error or is it impossible to get missing values ​​in the first column (or do I have incorrect data?).

+6
source share
1 answer

I believe that this is the same error that I reported here .

The latest version that I know will work with this type of input - Rev. 1180. You can check and build this version by adding @1180 to the end of the svn checkout .

 svn checkout svn://svn.r-forge.r-project.org/svnroot/datatable/@1180 

If you are not familiar with checking and creating packages, see here

But there are many great features, bug fixes, and improvements since version 1180. (The deveolpment version at the time of this writing is Rev. 1272). Thus, the best solution is to replace the R/fread.R and src/fread.c versions from version 1180 or later and then restore the package.

You can find these files on the Internet without checking them here (sorry, I cannot figure out how to post links containing "*", so you need to copy / paste):

fread.R:
http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/R/fread.R?revision=988&root=datatable

fread.c:
http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/src/fread.c?revision=1159&root=datatable

After you rebuild the package, you can read your tsv file.

 > fread("12\t876\t19\n23\t39\t\n\t15\t20") V1 V2 V3 1: 12 876 19 2: 23 39 NA 3: NA 15 20 

The disadvantage of this is that the old version of fread() does not pass the newer test - you cannot read the fields with quotes in the middle.

 > fread('A,B,C\n1.2,Foo"Bar,"a"b\"c"d"\nfo"o,bar,"b,az""\n') Error in fread("A,B,C\n1.2,Foo\"Bar,\"a\"b\"c\"d\"\nfo\"o,bar,\"b,az\"\"\n") : Not positioned correctly after testing format of header row. ch=',' 

With newer versions of fread you get this

 > fread('A,B,C\n1.2,Foo"Bar,"a"b\"c"d"\nfo"o,bar,"b,az""\n') ABC 1: 1.2 Foo"Bar a"b"c"d 2: fo"o bar b,az" 

So, at the moment, which version "works" depends on whether you have more likely missing values ​​in the first column or quotation marks in the fields. This is the first for me, so I'm still using the old code.

+2
source

Source: https://habr.com/ru/post/969764/


All Articles