A subset of columns with data.table in R

I am trying to fit a data set by selecting some columns from a data table. However, my code does not work with some options.

Here is an example data.table

library(data.table) DT <- data.table( ID = 1:50, Capacity = sample(100:1000, size = 50, replace = F), Code = sample(LETTERS[1:4], 50, replace = T), State = rep(c("Alabama","Indiana","Texas","Nevada"), 50)) 

Here is the code for the working subset:

 DT[,1:2] 

And here is a piece of code that doesn't work. Note that this works with the data framework, but not with data.table.

 DT[,seq(1:2)] 

I need something in the lines of the second format, because I am a subset based on the output of grep () and give the same result as the second format. What am I doing wrong?

Thanks!

+5
source share
3 answers

In recent versions of data.table, numbers can be used in j to indicate columns. This behavior includes formats such as DT[,1:2] to indicate the numerical range of columns. (Note that this syntax does not work in older versions of data.table).

So why does DT[,1:2] , but DT[,seq(1:2)] does not? The answer is incorrect in the code for data.table:::[.data.table , which includes the lines:

  if (!missing(j)) { jsub = replace_dot_alias(substitute(j)) root = if (is.call(jsub)) as.character(jsub[[1L]])[1L] else "" if (root == ":" || (root %chin% c("-", "!") && is.call(jsub[[2L]]) && jsub[[2L]][[1L]] == "(" && is.call(jsub[[2L]][[2L]]) && jsub[[2L]][[2L]][[1L]] == ":") || (!length(all.vars(jsub)) && root %chin% c("", "c", "paste", "paste0", "-", "!") && missing(by))) { with = FALSE } 

We see here that data.table automatically sets the with = FALSE parameter for you when it detects the use of a function : in j . It does not have the same functionality as for seq , so we need to specify with = FALSE if we want to use the seq syntax.

 DT[,seq(1:2), with=FALSE] 
+3
source

The lesson I learned is to use list instead of c :

  DT[ ,list(ID,Capacity)] #--------------------------- ID Capacity 1: 1 483 2: 2 703 3: 3 924 4: 4 267 5: 5 588 --- 196: 46 761 197: 47 584 198: 48 402 199: 49 416 200: 50 130 

This allows you to ignore these annoying quotes, and also moves you in the direction of viewing the argument j as an evaluated expression with the environment of the datatable itself.

To "get" named columns by number, use the mget function and the names function. R 'names' are language elements, i.e. Data objects in the search path from the current environment. Data column names are not actually R names . Thus, you need a function that takes the value of the character and forces the interpreter to consider it a fully qualified name . Datatable - [- function syntax for element j treats column names as language objects, not character values, like [.data.frame -function:

 DT[ ,mget(names(DT)[c(1,2)])] ID Capacity 1: 1 483 2: 2 703 3: 3 924 4: 4 267 5: 5 588 --- 196: 46 761 197: 47 584 198: 48 402 199: 49 416 200: 50 130 
+5
source

The main problem here is that the columns in data.table are reference objects, so you cannot use the same syntax as data.frame. those. without quoted names or numbers

therefore, DT[,c("ID", "Capacity")] will not work for the same reason that DT[,seq(1:2)] will not work.

However, adding ,with=FALSE causes data.table to refer as data.frame will

therefore, DT[,c("ID", "Capacity"), with=FALSE] and DT[,seq(1:2), with=FALSE] now gives you what you want.

  ID Capacity 1: 1 913 2: 2 602 3: 3 861 4: 4 967 5: 5 374 --- 196: 46 163 197: 47 254 198: 48 390 199: 49 853 200: 50 486 

EDIT: as directed by @Rich Scriven

+3
source

Source: https://habr.com/ru/post/1263125/


All Articles