Speed โ€‹โ€‹up `strsplit` when possible exit known

I have a large data frame with a factor column, which I need to split into three factor columns, dividing the factor names by a separator. Here is my current approach, which is very slow with a large data frame (sometimes several million rows):

data <- readRDS("data.rds") data.df <- reshape2:::melt.array(data) head(data.df) ## Time Location Class Replicate Population ##1 1 1 LIDE.1.S 1 0.03859605 ##2 2 1 LIDE.1.S 1 0.03852957 ##3 3 1 LIDE.1.S 1 0.03846853 ##4 4 1 LIDE.1.S 1 0.03841260 ##5 5 1 LIDE.1.S 1 0.03836147 ##6 6 1 LIDE.1.S 1 0.03831485 Rprof("str.out") cl <- which(names(data.df)=="Class") Classes <- do.call(rbind, strsplit(as.character(data.df$Class), "\\.")) colnames(Classes) <- c("Species", "SizeClass", "Infected") data.df <- cbind(data.df[,1:(cl-1)],Classes,data.df[(cl+1):(ncol(data.df))]) Rprof(NULL) head(data.df) ## Time Location Species SizeClass Infected Replicate Population ##1 1 1 LIDE 1 S 1 0.03859605 ##2 2 1 LIDE 1 S 1 0.03852957 ##3 3 1 LIDE 1 S 1 0.03846853 ##4 4 1 LIDE 1 S 1 0.03841260 ##5 5 1 LIDE 1 S 1 0.03836147 ##6 6 1 LIDE 1 S 1 0.03831485 summaryRprof("str.out") $by.self self.time self.pct total.time total.pct "strsplit" 1.34 50.00 1.34 50.00 "<Anonymous>" 1.16 43.28 1.16 43.28 "do.call" 0.04 1.49 2.54 94.78 "unique.default" 0.04 1.49 0.04 1.49 "data.frame" 0.02 0.75 0.12 4.48 "is.factor" 0.02 0.75 0.02 0.75 "match" 0.02 0.75 0.02 0.75 "structure" 0.02 0.75 0.02 0.75 "unlist" 0.02 0.75 0.02 0.75 $by.total total.time total.pct self.time self.pct "do.call" 2.54 94.78 0.04 1.49 "strsplit" 1.34 50.00 1.34 50.00 "<Anonymous>" 1.16 43.28 1.16 43.28 "cbind" 0.14 5.22 0.00 0.00 "data.frame" 0.12 4.48 0.02 0.75 "as.data.frame.matrix" 0.08 2.99 0.00 0.00 "as.data.frame" 0.08 2.99 0.00 0.00 "as.factor" 0.08 2.99 0.00 0.00 "factor" 0.06 2.24 0.00 0.00 "unique.default" 0.04 1.49 0.04 1.49 "unique" 0.04 1.49 0.00 0.00 "is.factor" 0.02 0.75 0.02 0.75 "match" 0.02 0.75 0.02 0.75 "structure" 0.02 0.75 0.02 0.75 "unlist" 0.02 0.75 0.02 0.75 "[.data.frame" 0.02 0.75 0.00 0.00 "[" 0.02 0.75 0.00 0.00 $sample.interval [1] 0.02 $sampling.time [1] 2.68 

Is there any way to speed up this operation? I note that there is a small (<5) amount of each of the Views, SizeClass, and Infected categories, and I know that this is in advance.

Notes:

  • stringr::str_split_fixed performs this task, but not faster
  • A data frame is actually initially generated by calling reshape::melt in an array in which Class and its associated levels are dimensions. If there is a faster way to get from there to here, great.
  • data.rds at http://dl.getdropbox.com/u/3356641/data.rds
+6
source share
3 answers

This should probably increase:

 library(data.table) DT <- data.table(data.df) DT[, c("Species", "SizeClass", "Infected") := as.list(strsplit(Class, "\\.")[[1]]), by=Class ] 

Reasons for the increase:

  • data.table pre allocates memory for columns
  • each column assignment in data.frame reassigns the entirety of the data (as opposed to .table data)
  • The by operator allows you to implement the strsplit task once for each unique value.

Here is a good quick method for the whole process.

 # Save the new col names as a character vector newCols <- c("Species", "SizeClass", "Infected") # split the string, then convert the new cols to columns DT[, c(newCols) := as.list(strsplit(as.character(Class), "\\.")[[1]]), by=Class ] DT[, c(newCols) := lapply(.SD, factor), .SDcols=newCols] # remove the old column. This is instantaneous. DT[, Class := NULL] ## Have a look: DT[, lapply(.SD, class)] # Time Location Replicate Population Species SizeClass Infected # 1: integer integer integer numeric factor factor factor DT 
+5
source

You can get a decent increase in speed by simply extracting parts of the desired line using gsub instead of splitting everything and trying to put it together:

 data <- readRDS("~/Downloads/data.rds") data.df <- reshape2:::melt.array(data) # using `strsplit` system.time({ cl <- which(names(data.df)=="Class") Classes <- do.call(rbind, strsplit(as.character(data.df$Class), "\\.")) colnames(Classes) <- c("Species", "SizeClass", "Infected") data.df <- cbind(data.df[,1:(cl-1)],Classes,data.df[(cl+1):(ncol(data.df))]) }) user system elapsed 3.349 0.062 3.411 #using `gsub` system.time({ data.df$Class <- as.character(data.df$Class) data.df$SizeClass <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\2", data.df$Class, perl = TRUE) data.df$Infected <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\3", data.df$Class, perl = TRUE) data.df$Class <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\1", data.df$Class, perl = TRUE) }) user system elapsed 0.812 0.037 0.848 
+3
source

It looks like you have a factor, so work on the levels and then come back. Use fixed=TRUE in strsplit , setting to split="." .

 Classes <- do.call(rbind, strsplit(levels(data.df$Class), ".", fixed=TRUE)) colnames(Classes) <- c("Species", "SizeClass", "Infected") df0 <- as.data.frame(Classes[data.df$Class,], row.names=NA) cbind(data.df, df0) 
+2
source

Source: https://habr.com/ru/post/945388/


All Articles