Speed up `strsplit` when possible exit known

Question

Speed up `strsplit` when possible exit known

I have a large data frame with a factor column, which I need to split into three factor columns, dividing the factor names by a separator. Here is my current approach, which is very slow with a large data frame (sometimes several million rows):

data <- readRDS("data.rds") data.df <- reshape2:::melt.array(data) head(data.df) ## Time Location Class Replicate Population ##1 1 1 LIDE.1.S 1 0.03859605 ##2 2 1 LIDE.1.S 1 0.03852957 ##3 3 1 LIDE.1.S 1 0.03846853 ##4 4 1 LIDE.1.S 1 0.03841260 ##5 5 1 LIDE.1.S 1 0.03836147 ##6 6 1 LIDE.1.S 1 0.03831485 Rprof("str.out") cl <- which(names(data.df)=="Class") Classes <- do.call(rbind, strsplit(as.character(data.df$Class), "\\.")) colnames(Classes) <- c("Species", "SizeClass", "Infected") data.df <- cbind(data.df[,1:(cl-1)],Classes,data.df[(cl+1):(ncol(data.df))]) Rprof(NULL) head(data.df) ## Time Location Species SizeClass Infected Replicate Population ##1 1 1 LIDE 1 S 1 0.03859605 ##2 2 1 LIDE 1 S 1 0.03852957 ##3 3 1 LIDE 1 S 1 0.03846853 ##4 4 1 LIDE 1 S 1 0.03841260 ##5 5 1 LIDE 1 S 1 0.03836147 ##6 6 1 LIDE 1 S 1 0.03831485 summaryRprof("str.out") $by.self self.time self.pct total.time total.pct "strsplit" 1.34 50.00 1.34 50.00 "<Anonymous>" 1.16 43.28 1.16 43.28 "do.call" 0.04 1.49 2.54 94.78 "unique.default" 0.04 1.49 0.04 1.49 "data.frame" 0.02 0.75 0.12 4.48 "is.factor" 0.02 0.75 0.02 0.75 "match" 0.02 0.75 0.02 0.75 "structure" 0.02 0.75 0.02 0.75 "unlist" 0.02 0.75 0.02 0.75 $by.total total.time total.pct self.time self.pct "do.call" 2.54 94.78 0.04 1.49 "strsplit" 1.34 50.00 1.34 50.00 "<Anonymous>" 1.16 43.28 1.16 43.28 "cbind" 0.14 5.22 0.00 0.00 "data.frame" 0.12 4.48 0.02 0.75 "as.data.frame.matrix" 0.08 2.99 0.00 0.00 "as.data.frame" 0.08 2.99 0.00 0.00 "as.factor" 0.08 2.99 0.00 0.00 "factor" 0.06 2.24 0.00 0.00 "unique.default" 0.04 1.49 0.04 1.49 "unique" 0.04 1.49 0.00 0.00 "is.factor" 0.02 0.75 0.02 0.75 "match" 0.02 0.75 0.02 0.75 "structure" 0.02 0.75 0.02 0.75 "unlist" 0.02 0.75 0.02 0.75 "[.data.frame" 0.02 0.75 0.00 0.00 "[" 0.02 0.75 0.00 0.00 $sample.interval [1] 0.02 $sampling.time [1] 2.68

Is there any way to speed up this operation? I note that there is a small (<5) amount of each of the Views, SizeClass, and Infected categories, and I know that this is in advance.

Notes:

stringr::str_split_fixed performs this task, but not faster
A data frame is actually initially generated by calling reshape::melt in an array in which Class and its associated levels are dimensions. If there is a faster way to get from there to here, great.
data.rds at http://dl.getdropbox.com/u/3356641/data.rds

+6

performance r reshape2 strsplit stringr

Noam ross May 20, '13 at 12:39

source share

3 answers

You can get a decent increase in speed by simply extracting parts of the desired line using gsub instead of splitting everything and trying to put it together:

 data <- readRDS("~/Downloads/data.rds") data.df <- reshape2:::melt.array(data) # using `strsplit` system.time({ cl <- which(names(data.df)=="Class") Classes <- do.call(rbind, strsplit(as.character(data.df$Class), "\\.")) colnames(Classes) <- c("Species", "SizeClass", "Infected") data.df <- cbind(data.df[,1:(cl-1)],Classes,data.df[(cl+1):(ncol(data.df))]) }) user system elapsed 3.349 0.062 3.411 #using `gsub` system.time({ data.df$Class <- as.character(data.df$Class) data.df$SizeClass <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\2", data.df$Class, perl = TRUE) data.df$Infected <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\3", data.df$Class, perl = TRUE) data.df$Class <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\1", data.df$Class, perl = TRUE) }) user system elapsed 0.812 0.037 0.848

+3

Schaunw May 20, '13 at 1:04

source share

It looks like you have a factor, so work on the levels and then come back. Use fixed=TRUE in strsplit , setting to split="." .

 Classes <- do.call(rbind, strsplit(levels(data.df$Class), ".", fixed=TRUE)) colnames(Classes) <- c("Species", "SizeClass", "Infected") df0 <- as.data.frame(Classes[data.df$Class,], row.names=NA) cbind(data.df, df0)

+2

Martin morgan May 20, '13 at 4:19

source share

Ricardo saporta · Accepted Answer · 2013-05-20T00:55:13+0000

This should probably increase:

 library(data.table) DT <- data.table(data.df) DT[, c("Species", "SizeClass", "Infected") := as.list(strsplit(Class, "\\.")[[1]]), by=Class ]

Reasons for the increase:

data.table pre allocates memory for columns
each column assignment in data.frame reassigns the entirety of the data (as opposed to .table data)
The by operator allows you to implement the strsplit task once for each unique value.

Here is a good quick method for the whole process.

 # Save the new col names as a character vector newCols <- c("Species", "SizeClass", "Infected") # split the string, then convert the new cols to columns DT[, c(newCols) := as.list(strsplit(as.character(Class), "\\.")[[1]]), by=Class ] DT[, c(newCols) := lapply(.SD, factor), .SDcols=newCols] # remove the old column. This is instantaneous. DT[, Class := NULL] ## Have a look: DT[, lapply(.SD, class)] # Time Location Replicate Population Species SizeClass Infected # 1: integer integer integer numeric factor factor factor DT

Speed ​​up `strsplit` when possible exit known

More articles:

Speed up `strsplit` when possible exit known