This old question is often used as the target of cheating (tagged r-faq ). To date, there have been three answers, suggesting 6 different approaches, but does not have a benchmark , since the leadership which of the approaches is the fastest 1 .
Reference decisions include
- Matthew Lundberg's approach to R base, but modified according to Rich Scriven's comment ,
- Jaap two
data.table methods and two dplyr / tidyr , - Ananda
splitstackshape solution , - and two additional variations of the Jaap
data.table methods.
Altogether, 8 different methods were compared with 6 different data frame sizes using the microbenchmark package (see code below).
The sample data provided by the OP consists of only 20 rows. To create large data frames, these 20 rows are simply repeated 1, 10, 100, 1000, 10,000 and 100,000 times, which gives problems up to 2 million rows.
Test results

Test results show that for sufficiently large data frames, all data.table methods are faster than any other method. For data frames containing more than 5,000 rows, the Jaap data.table 2 method and the DT3 variant are the fastest, values faster than the slowest methods.
Remarkably, the timings of the two tidyverse methods and the splistackshape solution splistackshape so similar that it is difficult to distinguish the curves in the diagram. They are the slowest of the compared methods in all sizes of data frames.
For smaller data frames, the Matt base R solution and the data.table 4 method data.table have less overhead than other methods.
The code
director <- c("Aaron Blaise,Bob Walker", "Akira Kurosawa", "Alan J. Pakula", "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", "Anne Fontaine", "Anthony Harvey") AB <- c("A", "B", "A", "A", "B", "B", "B", "A", "B", "A", "B", "A", "A", "B", "B", "B", "B", "B", "B", "A") library(data.table) library(magrittr)
Define a function for test runs of problem size n
run_mb <- function(n) { # compute number of benchmark runs depending on problem size `n` mb_times <- scales::squish(10000L / n , c(3L, 100L)) cat(n, " ", mb_times, "\n") # create data DF <- data.frame(director = rep(director, n), AB = rep(AB, n)) DT <- as.data.table(DF) # start benchmarks microbenchmark::microbenchmark( matt_mod = { s <- strsplit(as.character(DF$director), ',') data.frame(director=unlist(s), AB=rep(DF$AB, lengths(s)))}, jaap_DT1 = { DT[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB ][!is.na(director)]}, jaap_DT2 = { DT[, strsplit(as.character(director), ",", fixed=TRUE), by = .(AB, director)][,.(director = V1, AB)]}, jaap_dplyr = { DF %>% dplyr::mutate(director = strsplit(as.character(director), ",")) %>% tidyr::unnest(director)}, jaap_tidyr = { tidyr::separate_rows(DF, director, sep = ",")}, cSplit = { splitstackshape::cSplit(DF, "director", ",", direction = "long")}, DT3 = { DT[, strsplit(as.character(director), ",", fixed=TRUE), by = .(AB, director)][, director := NULL][ , setnames(.SD, "V1", "director")]}, DT4 = { DT[, .(director = unlist(strsplit(as.character(director), ",", fixed = TRUE))), by = .(AB)]}, times = mb_times ) }
Run a performance test for different size problems
# define vector of problem sizes n_rep <- 10L^(0:5) # run benchmark for different problem sizes mb <- lapply(n_rep, run_mb)
Prepare data for printing
mbl <- rbindlist(mb, idcol = "N") mbl[, n_row := NROW(director) * n_rep[N]] mba <- mbl[, .(median_time = median(time), N = .N), by = .(n_row, expr)] mba[, expr := forcats::fct_reorder(expr, -median_time)]
Create chart
library(ggplot2) ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) + geom_point() + geom_smooth(se = FALSE) + scale_x_log10(breaks = NROW(director) * n_rep) + scale_y_log10() + xlab("number of rows") + ylab("median of execution time [ms]") + ggtitle("microbenchmark results") + theme_bw()
Session and package version information (excerpt)
devtools::session_info() #Session info # version R version 3.3.2 (2016-10-31) # system x86_64, mingw32 #Packages # data.table * 1.10.4 2017-02-01 CRAN (R 3.3.2) # dplyr 0.5.0 2016-06-24 CRAN (R 3.3.1) # forcats 0.2.0 2017-01-23 CRAN (R 3.3.2) # ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.3.2) # magrittr * 1.5 2014-11-22 CRAN (R 3.3.0) # microbenchmark 1.4-2.1 2015-11-25 CRAN (R 3.3.3) # scales 0.4.1 2016-11-09 CRAN (R 3.3.2) # splitstackshape 1.4.2 2014-10-23 CRAN (R 3.3.3) # tidyr 0.6.1 2017-01-10 CRAN (R 3.3.2)
1 My curiosity was caused by this violent commentary Excellent! Orders are faster! to tidyverse answer to a question that was closed as a duplicate of this question.