I have a script that reads data from a CSV file into data.table , and then splits the text in one column into several new columns. I am currently using lapply and strsplit for this. Here is an example:
library("data.table") df = data.table(PREFIX = c("A_B","A_C","A_D","B_A","B_C","B_D"), VALUE = 1:6) dt = as.data.table(df) # split PREFIX into new columns dt$PX = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 1)) dt$PY = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 2)) dt # PREFIX VALUE PX PY # 1: A_B 1 AB # 2: A_C 2 AC # 3: A_D 3 AD # 4: B_A 4 BA # 5: B_C 5 BC # 6: B_D 6 BD
In the above example, the PREFIX column is split into two new columns PX and PY by the character "_".
Although this works fine, I was wondering if there is a better (more efficient) way to do this using data.table . My real data sets have> = 10M + rows, so time / memory efficiency is becoming very important.
UPDATE:
Following @Frank's suggestion, I created a larger test case and used the suggested commands, but stringr::str_split_fixed takes a lot longer than the original method.
library("data.table") library("stringr") system.time ({ df = data.table(PREFIX = rep(c("A_B","A_C","A_D","B_A","B_C","B_D"), 1000000), VALUE = rep(1:6, 1000000)) dt = data.table(df) }) # user system elapsed # 0.682 0.075 0.758 system.time({ dt[, c("PX","PY") := data.table(str_split_fixed(PREFIX,"_",2))] }) # user system elapsed # 738.283 3.103 741.674 rm(dt) system.time ( { df = data.table(PREFIX = rep(c("A_B","A_C","A_D","B_A","B_C","B_D"), 1000000), VALUE = rep(1:6, 1000000) ) dt = as.data.table(df) }) # user system elapsed # 0.123 0.000 0.123 # split PREFIX into new columns system.time ({ dt$PX = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 1)) dt$PY = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 2)) }) # user system elapsed # 33.185 0.000 33.191
Thus, the str_split_fixed method takes about 20X times.