In R, using the data.table, how to exclude rows and how one of them includes NA values ​​in an integer column

I use data.table quite a lot. It works well, but I find that it takes me a long time to translate my syntax so that it takes advantage of binary search.

In the following data table, how to 1 select all rows, including the CPT NA value, but exclude rows where the CPT value is 23456 or 10000.

 cpt <- c(23456,23456,10000,44555,44555,NA) description <- c("tonsillectomy","tonsillectomy in >12 year old","brain transplant","castration","orchidectomy","miscellaneous procedure") cpt.desc <- data.table(cpt,description) setkey(cpt.desc,cpt) 

The next line works, but I think it uses a vector check method instead of a binary search (or binary exception). Is there a way to discard strings by binary methods?

 cpt.desc[!cpt %in% c(23456,10000),] 
+4
source share
1 answer

Only a partial answer, because I'm new to data.table. Self-join works for a number, but the same doesn't work for strings. I am sure that one of the professional data tables knows what to do.

 library(data.table) n <- 1000000 cpt.desc <- data.table( cpt=rep(c(23456,23456,10000,44555,44555,NA),n), description=rep(c("tonsillectomy","tonsillectomy in >12 year old","brain transplant","castration","orchidectomy","miscellaneous procedure"),n)) # Added on revision. Not very elegant, though. Faster by factor of 3 # but probably better scaling setkey(cpt.desc,cpt) system.time(a<-cpt.desc[-cpt.desc[J(23456,45555),which=TRUE]]) system.time(b<-cpt.desc[!(cpt %in% c(23456,45555))] ) str(a) str(b) identical(as.data.frame(a),as.data.frame(b)) # A self-join works Ok with numbers setkey(cpt.desc,cpt) system.time(a<-cpt.desc[cpt %in% c(23456,45555),]) system.time(b<-cpt.desc[J(23456,45555)]) str(a) str(b) identical(as.data.frame(a),as.data.frame(b)[,-3]) # But the same failes with characters setkey(cpt.desc,description) system.time(a<-cpt.desc[description %in% c("castration","orchidectomy"),]) system.time(b<-cpt.desc[J("castration","orchidectomy"),]) identical(as.data.frame(a),as.data.frame(b)[,-3]) str(a) str(b) 
+2
source

Source: https://habr.com/ru/post/1391738/


All Articles