I have a huge file with 7 million records and 160 variables. I found out that fread () and read.csv.ffdf () are two ways to handle such big data. But when I try to use dplyr to filter these two datasets, I get different results. Below is a small section of my data -
sample_data
AGE AGE_NEONATE AMONTH AWEEKEND
2 18 5 0
3 32 11 0
4 67 7 0
5 37 6 1
6 57 5 0
7 50 6 0
8 59 12 0
9 44 9 0
10 40 9 0
11 27 3 0
12 59 8 0
13 44 7 0
14 81 10 0
15 59 6 1
16 32 10 0
17 90 12 1
18 69 7 0
19 62 11 1
20 85 6 1
21 43 10 0
Code1
sample_data <- fread("/user/sample_data.csv", stringsAsFactors = T)
age_filter<-sample_data%>%filter(!(is.na(AGE)), between(as.numeric(AGE),65 , 95))
Result1 -
AGE AGE_NEONATE AMONTH AWEEKEND
1 67 NA 7 0
2 81 NA 10 0
3 90 NA 12 1
4 69 NA 7 0
5 85 NA 6 1
Codex2 -
sample_data <- read.csv.ffdf(file="C:/Users/sample_data.csv", header=F ,fill=T)
header.true <- function(df) {
names(df) <- as.character(unlist(df[1,]))
df[-1,]
}
sample_data<-tbl_ffdf(sample_data)
sample_data<-header.true(sample_data)
age_filter<-sample_data%>%filter(!(is.na(AGE)), between(as.numeric(AGE),65 , 95))
Result 2 -
AGE AGE_NEONATE AMONTH AWEEKEND
1 81 10 0
2 90 12 1
3 85 6 1
I know that my first code is correct and gives the correct results. What am I doing wrong in the second code?
source
share