An idiom for deleting a single column in a data table.

I need to drop one column from the data.frame file containing several hundred columns.

With data.frame I would use subset for this:

 > dat <- data.table( data.frame(x=runif(10),y=rep(letters[1:5],2),z=runif(10)),key='y' ) > subset(dat,select=c(-z)) xy 1: 0.1969049 a 2: 0.7916696 a 3: 0.9095970 b 4: 0.3529506 b 5: 0.4923602 c 6: 0.5993034 c 7: 0.1559861 d 8: 0.9929333 d 9: 0.3980169 e 10: 0.1921226 e 

Obviously this still works, but it doesn't seem like a lot like the t23-like idiom. I could manually create a list of column names that I wanted to keep, which seems a bit more data.table -like:

 > dat[,list(x,y)] xy 1: 0.1969049 a 2: 0.7916696 a 3: 0.9095970 b 4: 0.3529506 b 5: 0.4923602 c 6: 0.5993034 c 7: 0.1559861 d 8: 0.9929333 d 9: 0.3980169 e 10: 0.1921226 e 

But then I need to build a list that is awkward.

Is subset right way to conveniently remove a column or two, or does it lead to performance degradation? If not, which is better?

Edit

Landmarks:

 > dat <- data.table( data.frame(x=runif(10^7),y=rep(letters[1:10],10^6),z=runif(10^7)),key='y' ) > microbenchmark( subset(dat,select=c(-z)), dat[,list(x,y)] ) Unit: milliseconds expr min lq median uq max 1 dat[, list(x, y)] 102.62826 167.86793 170.72847 199.89789 792.0207 2 subset(dat, select = c(-z)) 33.26356 52.55311 53.53934 55.00347 180.8740 

But in fact, where it may matter, it is memory if the subset copies all data.table .

+6
source share
2 answers

If you want to permanently delete a column, use := NULL

 dat[, z := NULL] 

If you have columns as a character string, use () to force evaluation as a character string, not as a character name.

 toDrop <- c('z') dat[, (toDrop) := NULL] 

If you want to limit the availability of columns in .SD , you can pass the .SDcols argument

 dat[,lapply(.SD, somefunction) , .SDcols = setdiff(names(dat),'z')] 

However, data.table checks the j arguments and only gets the columns that you use in any way. See FAQ 1.12

When you write X [Y, sum (foo * bar)], data.table automatically checks the expression j to see which columns it uses.

and not trying to load all the data for .SD (unless you have .SD in your j call)


subset.data.table processes the call and ultimately evaluates dat[, c('x','y'), with=FALSE]

using := NULL should be mostly instantaneous, howver t permanently removes the column.

+9
source

I think this is what you are looking for.

 dat[, !"z"] 

Here is the benchmark for the huge data from your edits.

 Unit: milliseconds expr min lq median uq max neval subset(dat, select = c(-z)) 53.37435 56.82514 61.81279 100.3458 339.1400 100 dat[, list(x, y)] 191.46678 354.39905 412.06421 451.3933 678.3981 100 dat[, !"z"] 53.49184 57.31756 62.15506 112.7063 398.0107 100 
+1
source

Source: https://habr.com/ru/post/944670/


All Articles