Split text string in data.table columns

I have a script that reads data from a CSV file into data.table , and then splits the text in one column into several new columns. I am currently using lapply and strsplit for this. Here is an example:

 library("data.table") df = data.table(PREFIX = c("A_B","A_C","A_D","B_A","B_C","B_D"), VALUE = 1:6) dt = as.data.table(df) # split PREFIX into new columns dt$PX = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 1)) dt$PY = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 2)) dt # PREFIX VALUE PX PY # 1: A_B 1 AB # 2: A_C 2 AC # 3: A_D 3 AD # 4: B_A 4 BA # 5: B_C 5 BC # 6: B_D 6 BD 

In the above example, the PREFIX column is split into two new columns PX and PY by the character "_".

Although this works fine, I was wondering if there is a better (more efficient) way to do this using data.table . My real data sets have> = 10M + rows, so time / memory efficiency is becoming very important.




UPDATE:

Following @Frank's suggestion, I created a larger test case and used the suggested commands, but stringr::str_split_fixed takes a lot longer than the original method.

 library("data.table") library("stringr") system.time ({ df = data.table(PREFIX = rep(c("A_B","A_C","A_D","B_A","B_C","B_D"), 1000000), VALUE = rep(1:6, 1000000)) dt = data.table(df) }) # user system elapsed # 0.682 0.075 0.758 system.time({ dt[, c("PX","PY") := data.table(str_split_fixed(PREFIX,"_",2))] }) # user system elapsed # 738.283 3.103 741.674 rm(dt) system.time ( { df = data.table(PREFIX = rep(c("A_B","A_C","A_D","B_A","B_C","B_D"), 1000000), VALUE = rep(1:6, 1000000) ) dt = as.data.table(df) }) # user system elapsed # 0.123 0.000 0.123 # split PREFIX into new columns system.time ({ dt$PX = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 1)) dt$PY = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 2)) }) # user system elapsed # 33.185 0.000 33.191 

Thus, the str_split_fixed method takes about 20X times.

+48
r data.table
Aug 09 '13 at 19:49
source share
4 answers

Update: From version 1.9.6 (according to CRAN from September 15) we can use the tstrsplit() function to get the results directly (and in a more efficient way):

 require(data.table) ## v1.9.6+ dt[, c("PX", "PY") := tstrsplit(PREFIX, "_", fixed=TRUE)] # PREFIX VALUE PX PY # 1: A_B 1 AB # 2: A_C 2 AC # 3: A_D 3 AD # 4: B_A 4 BA # 5: B_C 5 BC # 6: B_D 6 BD 

tstrsplit() is basically a wrapper for transpose(strsplit()) , where the transpose() function, also recently implemented, wraps the list. See ?tstrsplit() and ?transpose() ?tstrsplit() for an example.

See the history of old answers.

+63
Aug 10 '13 at 1:00
source share

I am adding an answer for those who do not use data.table v1.9.5 , and also want a one-line solution.

 dt[, c('PX','PY') := do.call(Map, c(f = c, strsplit(PREFIX, '-'))) ] 
+12
Apr 01 '15 at 7:48
source share

Using splitstackshape :

 library(splitstackshape) cSplit(df, splitCols = "PREFIX", sep = "_", direction = "wide", drop = FALSE) # PREFIX VALUE PREFIX_1 PREFIX_2 # 1: A_B 1 AB # 2: A_C 2 AC # 3: A_D 3 AD # 4: B_A 4 BA # 5: B_C 5 BC # 6: B_D 6 BD 
+5
May 16 '16 at 20:07
source share

With tidir solution:

 separate(df,col = "PREFIX",into = c("PX", "PY"), sep = "_") 
0
Jul 30 '17 at 13:12
source share



All Articles