Why does my R-code for filtering data give different results with "fread ()" and "ffdf ()"?

Question

Why does my R-code for filtering data give different results with "fread ()" and "ffdf ()"?

I have a huge file with 7 million records and 160 variables. I found out that fread () and read.csv.ffdf () are two ways to handle such big data. But when I try to use dplyr to filter these two datasets, I get different results. Below is a small section of my data -

 sample_data
AGE AGE_NEONATE AMONTH AWEEKEND
2   18                  5        0
3   32                 11        0
4   67                  7        0
5   37                  6        1
6   57                  5        0
7   50                  6        0
8   59                 12        0
9   44                  9        0
10  40                  9        0
11  27                  3        0
12  59                  8        0
13  44                  7        0
14  81                 10        0
15  59                  6        1
16  32                 10        0
17  90                 12        1
18  69                  7        0
19  62                 11        1
20  85                  6        1
21  43                 10        0

Code1

sample_data <- fread("/user/sample_data.csv", stringsAsFactors = T)
age_filter<-sample_data%>%filter(!(is.na(AGE)), between(as.numeric(AGE),65 , 95))

Result1 -

AGE AGE_NEONATE AMONTH AWEEKEND
1  67          NA      7        0
2  81          NA     10        0
3  90          NA     12        1
4  69          NA      7        0
5  85          NA      6        1

Codex2 -

sample_data <- read.csv.ffdf(file="C:/Users/sample_data.csv", header=F ,fill=T)
header.true <- function(df) {
      names(df) <- as.character(unlist(df[1,]))
      df[-1,]
      }
sample_data<-tbl_ffdf(sample_data)
sample_data<-header.true(sample_data)
age_filter<-sample_data%>%filter(!(is.na(AGE)), between(as.numeric(AGE),65 , 95))

Result 2 -

AGE AGE_NEONATE AMONTH AWEEKEND
1  81                 10        0
2  90                 12        1
3  85                  6        1

I know that my first code is correct and gives the correct results. What am I doing wrong in the second code?

+4

r data.table dplyr ff

Deepa dongarwar Mar 27 '18 at 17:21

source share

1 answer

RolandASc · Answer 1 · 2018-03-29T13:58:41+0000

I really have not tried to use your code, but from what I see, I suspect the following:

. , character, numeric.
, default.stringsAsFactors() TRUE, , factors.

, between 65 95, . , , (), 67 69, , 65 (.. as.numeric(AGE) , , , ).

stringsAsFactors = FALSE .

Why does my R-code for filtering data give different results with "fread ()" and "ffdf ()"?

More articles: