Convert * some * column classes to data.table

I want to convert a subset of data.table cols to a new class. There is a popular question here ( Converting column classes to data.table ), but the answer creates a new object, and does not work on the starting object.

Take this example:

dat <- data.frame(ID=c(rep("A", 5), rep("B",5)), Quarter=c(1:5, 1:5), value=rnorm(10)) cols <- c('ID', 'Quarter') 

What is the best way to convert only cols columns to a (for example) factor? In normal data.frame you can do this:

 dat[, cols] <- lapply(dat[, cols], factor) 

but this does not work for data.table, and it also does not work

 dat[, .SD := lapply(.SD, factor), .SDcols = cols] 

A comment on a related question from Matt Dowle (since December 2013) suggests the following, which works great, but seems a little less elegant.

 for (j in cols) set(dat, j = j, value = factor(dat[[j]])) 

Is there currently a better answer to data.table (i.e. a shorter + does not generate a counter variable), or should I just use the above + rm(j) ?

+14
source share
2 answers

In addition to using the option suggested by Matt Dole, another way to change the classes of columns is as follows:

 dat[, (cols) := lapply(.SD, factor), .SDcols=cols] 

Using the := operator, you update the data by reference. Check if this works:

 > sapply(dat,class) ID Quarter value "factor" "factor" "numeric" 

As suggested by @MattDowle in the comments, you can also use the for(...) set(...) combination as follows:

 for (col in cols) set(dat, j = col, value = factor(dat[[col]])) 

which will give the same result. The third option:

 for (col in cols) dat[, (col) := factor(dat[[col]])] 

In smaller datasets, the for(...) set(...) parameter is about three times faster than the lapply parameter (but this does not really matter since it is a small data set). In large data sets (for example, 2 million rows), each of these approaches takes about the same amount of time. For testing on a larger dataset, I used:

 dat <- data.table(ID=c(rep("A", 1e6), rep("B",1e6)), Quarter=c(1:1e6, 1:1e6), value=rnorm(10)) 

Sometimes you will have to do it differently (for example, when numeric values ​​are stored as a factor). Then you should use something like this:

 dat[, (cols) := lapply(.SD, function(x) as.integer(as.character(x))), .SDcols=cols] 


WARNING: The following explanation is not data.table - a way to perform actions. This data is not updated by reference, because a copy is created and stored in memory (as indicated by @Frank), which increases memory usage. This is more suitable for explaining the work with=FALSE .

If you want to change the column classes in the same way as with the data framework, you must add with = FALSE as follows:

 dat[, cols] <- lapply(dat[, cols, with = FALSE], factor) 

Check if this works:

 > sapply(dat,class) ID Quarter value "factor" "factor" "numeric" 

If you do not add with = FALSE , datatable will evaluate dat[, cols] as a vector. Check the difference in output between dat[, cols] and dat[, cols, with=FALSE] :

 > dat[, cols] [1] "ID" "Quarter" > dat[, cols, with=FALSE] ID Quarter 1: A 1 2: A 2 3: A 3 4: A 4 5: A 5 6: B 1 7: B 2 8: B 3 9: B 4 10: B 5 
+26
source

You can use .SDcols :

dat[, cols] <- dat[, lapply(.SD, factor), .SDcols=cols]

+1
source

Source: https://habr.com/ru/post/1236536/


All Articles