A subset of the ffdf object (a subset vs ffwhich)

I execute a subset of large ffdf objects, and I noticed that when I use subset.ff, it generates a lot of NA. I tried an alternative method using ffwhich, and the result is much faster and no NSs are generated. Here is my test:

library(ffbase) # deals is the ffdf I would like to subset unique(deals$COMMODITY) ff (open) integer length=7 (7) levels: CASH CO2 COAL ELEC GAS GCERT OIL [1] [2] [3] [4] [5] [6] [7] CASH CO2 COAL ELEC GAS GCERT OIL # Using subset.ff started.at=proc.time() deals0 <- subset.ff(deals,deals$COMMODITY %in% c("CASH","COAL","CO2","ELEC","GCERT")) cat("Finished in",timetaken(started.at),"\n") Finished in 12.640sec # NAs are generated unique(deals0$COMMODITY) ff (open) integer length=8 (8) levels: CASH CO2 COAL ELEC GAS GCERT OIL <NA> [1] [2] [3] [4] [5] [6] [7] [8] CASH CO2 COAL ELEC GAS GCERT OIL NA # Subset using ffwhich started.at=proc.time() idx <- ffwhich(deals,COMMODITY %in% c("CASH","COAL","CO2","ELEC","GCERT")) deals1 <- deals[idx,] cat("Finished in",timetaken(started.at),"\n") Finished in 3.130sec # No NAs are generated unique(deals1$COMMODITY) ff (open) integer length=7 (7) levels: CASH CO2 COAL ELEC GAS GCERT OIL [1] [2] [3] [4] [5] [6] [7] CASH CO2 COAL ELEC GAS GCERT OIL 

Any idea why this is happening?

+4
source share
1 answer

subset.ff probably uses [ and your criteria, but does not include the sentence !is.na(.) . The default value for "[" is to return elements that are TRUE or NA for the criterion vector. The regular subset function adds the !is.na(.) , but perhaps the authors of ffbase did not bypass this.

+4
source

Source: https://habr.com/ru/post/1447967/


All Articles