For these unfamiliar ones, one-time coding simply refers to the conversion of a column of categories (i.e. a factor) into several columns of binary indicator variables, where each new column corresponds to one of the classes of the source column. This example will explain this better:
dt <- data.table( ID=1:5, Color=factor(c("green", "red", "red", "blue", "green"), levels=c("blue", "green", "red", "purple")), Shape=factor(c("square", "triangle", "square", "triangle", "cirlce")) ) dt ID Color Shape 1: 1 green square 2: 2 red triangle 3: 3 red square 4: 4 blue triangle 5: 5 green cirlce # one hot encode the colors color.binarized <- dcast(dt[, list(V1=1, ID, Color)], ID ~ Color, fun=sum, value.var="V1", drop=c(TRUE, FALSE)) # Prepend Color_ in front of each one-hot-encoded feature setnames(color.binarized, setdiff(colnames(color.binarized), "ID"), paste0("Color_", setdiff(colnames(color.binarized), "ID"))) # one hot encode the shapes shape.binarized <- dcast(dt[, list(V1=1, ID, Shape)], ID ~ Shape, fun=sum, value.var="V1", drop=c(TRUE, FALSE)) # Prepend Shape_ in front of each one-hot-encoded feature setnames(shape.binarized, setdiff(colnames(shape.binarized), "ID"), paste0("Shape_", setdiff(colnames(shape.binarized), "ID"))) # Join one-hot tables with original dataset dt <- dt[color.binarized, on="ID"] dt <- dt[shape.binarized, on="ID"] dt ID Color Shape Color_blue Color_green Color_red Color_purple Shape_cirlce Shape_square Shape_triangle 1: 1 green square 0 1 0 0 0 1 0 2: 2 red triangle 0 0 1 0 0 0 1 3: 3 red square 0 0 1 0 0 1 0 4: 4 blue triangle 1 0 0 0 0 0 1 5: 5 green cirlce 0 1 0 0 1 0 0
This is what I do a lot, and as you can see, it is rather tedious (especially when my data has many columns of factors). Is there an easier way to do this with data.table? In particular, I suggested that dcast would allow me to parse multiple columns at once when I try to do something like
dcast(dt[, list(V1=1, ID, Color, Shape)], ID ~ Color + Shape, fun=sum, value.var="V1", drop=c(TRUE, FALSE))
I get column combinations
ID blue_cirlce blue_square blue_triangle green_cirlce green_square green_triangle red_cirlce red_square red_triangle purple_cirlce purple_square purple_triangle 1: 1 0 0 0 0 1 0 0 0 0 0 0 0 2: 2 0 0 0 0 0 0 0 0 1 0 0 0 3: 3 0 0 0 0 0 0 0 1 0 0 0 0 4: 4 0 0 1 0 0 0 0 0 0 0 0 0 5: 5 0 0 0 1 0 0 0 0 0 0 0 0