Split comma-separated rows in a column into separate rows

I have a data frame, for example:

data.frame(director = c("Aaron Blaise,Bob Walker", "Akira Kurosawa", "Alan J. Pakula", "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", "Anne Fontaine", "Anthony Harvey"), AB = c('A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A')) 

As you can see, some entries in the director column are several names separated by commas. I would like to break these records into separate rows, keeping the values ​​of another column. As an example, the first row in the above data frame should be split into two rows with the same name in the director column and “A” in the AB column.

+89
string split r r-faq
Dec 08
source share
5 answers

This old question is often used as the target of cheating (tagged r-faq ). To date, there have been three answers, suggesting 6 different approaches, but does not have a benchmark , since the leadership which of the approaches is the fastest 1 .

Reference decisions include

  • Matthew Lundberg's approach to R base, but modified according to Rich Scriven's comment ,
  • Jaap two data.table methods and two dplyr / tidyr ,
  • Ananda splitstackshape solution ,
  • and two additional variations of the Jaap data.table methods.

Altogether, 8 different methods were compared with 6 different data frame sizes using the microbenchmark package (see code below).

The sample data provided by the OP consists of only 20 rows. To create large data frames, these 20 rows are simply repeated 1, 10, 100, 1000, 10,000 and 100,000 times, which gives problems up to 2 million rows.

Test results

enter image description here

Test results show that for sufficiently large data frames, all data.table methods are faster than any other method. For data frames containing more than 5,000 rows, the Jaap data.table 2 method and the DT3 variant are the fastest, values ​​faster than the slowest methods.

Remarkably, the timings of the two tidyverse methods and the splistackshape solution splistackshape so similar that it is difficult to distinguish the curves in the diagram. They are the slowest of the compared methods in all sizes of data frames.

For smaller data frames, the Matt base R solution and the data.table 4 method data.table have less overhead than other methods.

The code

 director <- c("Aaron Blaise,Bob Walker", "Akira Kurosawa", "Alan J. Pakula", "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", "Anne Fontaine", "Anthony Harvey") AB <- c("A", "B", "A", "A", "B", "B", "B", "A", "B", "A", "B", "A", "A", "B", "B", "B", "B", "B", "B", "A") library(data.table) library(magrittr) 

Define a function for test runs of problem size n

 run_mb <- function(n) { # compute number of benchmark runs depending on problem size `n` mb_times <- scales::squish(10000L / n , c(3L, 100L)) cat(n, " ", mb_times, "\n") # create data DF <- data.frame(director = rep(director, n), AB = rep(AB, n)) DT <- as.data.table(DF) # start benchmarks microbenchmark::microbenchmark( matt_mod = { s <- strsplit(as.character(DF$director), ',') data.frame(director=unlist(s), AB=rep(DF$AB, lengths(s)))}, jaap_DT1 = { DT[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB ][!is.na(director)]}, jaap_DT2 = { DT[, strsplit(as.character(director), ",", fixed=TRUE), by = .(AB, director)][,.(director = V1, AB)]}, jaap_dplyr = { DF %>% dplyr::mutate(director = strsplit(as.character(director), ",")) %>% tidyr::unnest(director)}, jaap_tidyr = { tidyr::separate_rows(DF, director, sep = ",")}, cSplit = { splitstackshape::cSplit(DF, "director", ",", direction = "long")}, DT3 = { DT[, strsplit(as.character(director), ",", fixed=TRUE), by = .(AB, director)][, director := NULL][ , setnames(.SD, "V1", "director")]}, DT4 = { DT[, .(director = unlist(strsplit(as.character(director), ",", fixed = TRUE))), by = .(AB)]}, times = mb_times ) } 

Run a performance test for different size problems

 # define vector of problem sizes n_rep <- 10L^(0:5) # run benchmark for different problem sizes mb <- lapply(n_rep, run_mb) 

Prepare data for printing

 mbl <- rbindlist(mb, idcol = "N") mbl[, n_row := NROW(director) * n_rep[N]] mba <- mbl[, .(median_time = median(time), N = .N), by = .(n_row, expr)] mba[, expr := forcats::fct_reorder(expr, -median_time)] 

Create chart

 library(ggplot2) ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) + geom_point() + geom_smooth(se = FALSE) + scale_x_log10(breaks = NROW(director) * n_rep) + scale_y_log10() + xlab("number of rows") + ylab("median of execution time [ms]") + ggtitle("microbenchmark results") + theme_bw() 

Session and package version information (excerpt)

 devtools::session_info() #Session info # version R version 3.3.2 (2016-10-31) # system x86_64, mingw32 #Packages # data.table * 1.10.4 2017-02-01 CRAN (R 3.3.2) # dplyr 0.5.0 2016-06-24 CRAN (R 3.3.1) # forcats 0.2.0 2017-01-23 CRAN (R 3.3.2) # ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.3.2) # magrittr * 1.5 2014-11-22 CRAN (R 3.3.0) # microbenchmark 1.4-2.1 2015-11-25 CRAN (R 3.3.3) # scales 0.4.1 2016-11-09 CRAN (R 3.3.2) # splitstackshape 1.4.2 2014-10-23 CRAN (R 3.3.3) # tidyr 0.6.1 2017-01-10 CRAN (R 3.3.2) 



1 My curiosity was caused by this violent commentary Excellent! Orders are faster! to tidyverse answer to a question that was closed as a duplicate of this question.

+64
Apr 15 '17 at 22:21
source share

Several alternatives:

1) in two ways with data.table :

 library(data.table) # method 1 (preferred) setDT(v)[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB ][!is.na(director)] # method 2 setDT(v)[, strsplit(as.character(director), ",", fixed=TRUE), by = .(AB, director) ][,.(director = V1, AB)] 

2) dplyr / tidyr : alternatively, you can also use dplyr / tidyr :

 library(dplyr) library(tidyr) v %>% mutate(director = strsplit(as.character(director), ",")) %>% unnest(director) 

3) only with tidyr : with tidyr 0.5.0 (and later) you can also just use separate_rows :

 separate_rows(v, director, sep = ",") 

You can use the convert = TRUE parameter to automatically convert numbers to numeric columns.

4) with base R:

 # if 'director' is a character-column: stack(setNames(strsplit(df$director,','), df$AB)) # if 'director' is a factor-column: stack(setNames(strsplit(as.character(df$director),','), df$AB)) 
+78
Jul 20 '15 at 10:56
source share

By naming the source data.frame v file, we have the following:

 > s <- strsplit(as.character(v$director), ',') > data.frame(director=unlist(s), AB=rep(v$AB, sapply(s, FUN=length))) director AB 1 Aaron Blaise A 2 Bob Walker A 3 Akira Kurosawa B 4 Alan J. Pakula A 5 Alan Parker A 6 Alejandro Amenabar B 7 Alejandro Gonzalez Inarritu B 8 Alejandro Gonzalez Inarritu B 9 Benicio Del Toro B 10 Alejandro González Iñárritu A 11 Alex Proyas B 12 Alexander Hall A 13 Alfonso Cuaron B 14 Alfred Hitchcock A 15 Anatole Litvak A 16 Andrew Adamson B 17 Marilyn Fox B 18 Andrew Dominik B 19 Andrew Stanton B 20 Andrew Stanton B 21 Lee Unkrich B 22 Angelina Jolie B 23 John Stevenson B 24 Anne Fontaine B 25 Anthony Harvey A 

Note the use of rep to create a new AB column. Here sapply returns the number of names in each of the source strings.

+47
Dec 08
source share

A later side, but another generic alternative is to use cSplit from my splitstackshape package, which has a direction argument. Set this parameter to "long" to get the specified result:

 library(splitstackshape) head(cSplit(mydf, "director", ",", direction = "long")) # director AB # 1: Aaron Blaise A # 2: Bob Walker A # 3: Akira Kurosawa B # 4: Alan J. Pakula A # 5: Alan Parker A # 6: Alejandro Amenabar B 
+29
03 Feb '15 at 4:34
source share
 devtools::install_github("yikeshu0611/onetree") library(onetree) dd=spread_byonecolumn(data=mydata,bycolumn="director",joint=",") head(dd) director AB 1 Aaron Blaise A 2 Bob Walker A 3 Akira Kurosawa B 4 Alan J. Pakula A 5 Alan Parker A 6 Alejandro Amenabar B 
0
Jul 26 '19 at 18:21
source share



All Articles