Separate rows with delimiters in a column and insert as new rows

I have a data frame:

+-----+-------+ | V1 | V2 | +-----+-------+ | 1 | a,b,c | | 2 | a,c | | 3 | b,d | | 4 | e,f | | . | . | +-----+-------+ 

Each of the alphabets is a comma separated character. I would like to split V2 into each comma and insert the separated lines as new lines. For example, the desired result would be:

 +----+----+ | V1 | V2 | +----+----+ | 1 | a | | 1 | b | | 1 | c | | 2 | a | | 2 | c | | 3 | b | | 3 | d | | 4 | e | | 4 | f | +----+----+ 

I am trying to use strsplit() to give a damn about V2 first and then listing the list into a data frame. This did not work. Any help would be appreciated.

+46
r dataframe reshape data-manipulation strsplit
Mar 11 '13 at 19:47
source share
6 answers

Here is another way to do this.

 df <- read.table(textConnection("1|a,b,c\n2|a,c\n3|b,d\n4|e,f"), header = F, sep = "|", stringsAsFactors = F) df ## V1 V2 ## 1 1 a,b,c ## 2 2 a,c ## 3 3 b,d ## 4 4 e,f s <- strsplit(df$V2, split = ",") data.frame(V1 = rep(df$V1, sapply(s, length)), V2 = unlist(s)) ## V1 V2 ## 1 1 a ## 2 1 b ## 3 1 c ## 4 2 a ## 5 2 c ## 6 3 b ## 7 3 d ## 8 4 e ## 9 4 f 
+36
Mar 11 '13 at 19:58
source share

As of December 2014, this can be done using the unsest function from the Hadley Wickham tidyr package (see the release notes for http://blog.rstudio.org/2014/12/08/tidyr-0-2-0/ )

 > library(tidyr) > library(dplyr) > mydf V1 V2 2 1 a,b,c 3 2 a,c 4 3 b,d 5 4 e,f 6 . . > mydf %>% mutate(V2 = strsplit(as.character(V2), ",")) %>% unnest(V2) V1 V2 1 1 a 2 1 b 3 1 c 4 2 a 5 2 c 6 3 b 7 3 d 8 4 e 9 4 f 10 . . 
+50
Dec 08 '14 at 15:07
source share

Here is a data.table solution:

 d.df <- read.table(header=T, text="V1 | V2 1 | a,b,c 2 | a,c 3 | b,d 4 | e,f", stringsAsFactors=F, sep="|", strip.white = TRUE) require(data.table) d.dt <- data.table(d.df, key="V1") out <- d.dt[, list(V2 = unlist(strsplit(V2, ","))), by=V1] # V1 V2 # 1: 1 a # 2: 1 b # 3: 1 c # 4: 2 a # 5: 2 c # 6: 3 b # 7: 3 d # 8: 4 e # 9: 4 f > sapply(out$V2, nchar) # (or simply nchar(out$V2)) # abcacbdef # 1 1 1 1 1 1 1 1 1 
+18
Mar 11 '13 at 20:08
source share

Now you can use tidyr 0.5.0 separate_rows instead of strsplit + unnest .

For example:

 library(tidyr) (df <- read.table(textConnection("1|a,b,c\n2|a,c\n3|b,d\n4|e,f"), header = F, sep = "|", stringsAsFactors = F)) 
  V1 V2 1 1 a,b,c 2 2 a,c 3 3 b,d 4 4 e,f 
 separate_rows(df, V2) 

gives:

  V1 V2 1 1 a 2 1 b 3 1 c 4 2 a 5 2 c 6 3 b 7 3 d 8 4 e 9 4 f 

See link: https://blog.rstudio.org/2016/06/13/tidyr-0-5-0/

+13
Jun 22 '16 at 22:01
source share

You can view cSplit with direction = "long" from my splitstackshape package.

Using:

 cSplit(mydf, "V2", ",", "long") ## V1 V2 ## 1: 1 a ## 2: 1 b ## 3: 1 c ## 4: 2 a ## 5: 2 c ## 6: 3 b ## 7: 3 d ## 8: 4 e ## 9: 4 f 



Old answer ....

Here is one approach using the R base. It is assumed that we start with data.frame named "mydf". It uses read.csv to read in the second column as a separate data.frame , which we combine with the first column from your original data. Finally, you use reshape to convert data to a long form.

 temp <- data.frame(Ind = mydf$V1, read.csv(text = as.character(mydf$V2), header = FALSE)) temp1 <- reshape(temp, direction = "long", idvar = "Ind", timevar = "time", varying = 2:ncol(temp), sep = "") temp1[!temp1$V == "", c("Ind", "V")] # Ind V # 1.1 1 a # 2.1 2 a # 3.1 3 b # 4.1 4 e # 1.2 1 b # 2.2 2 c # 3.2 3 d # 4.2 4 f # 1.3 1 c 

Another pretty direct alternative:

 stack( setNames( sapply(strsplit(mydf$V2, ","), function(x) gsub("^\\s|\\s$", "", x)), mydf$V1)) values ind 1 a 1 2 b 1 3 c 1 4 a 2 5 c 2 6 b 3 7 d 3 8 e 4 9 f 4 
+11
Mar 11 '13 at 19:55
source share

Another solution is data.table , which does not rely on the existence of any unique fields in the source data.

 DT = data.table(read.table(header=T, text="blah | splitme T | a,b,c T | a,c F | b,d F | e,f", stringsAsFactors=F, sep="|", strip.white = TRUE)) DT[,.( blah , splitme , splitted=unlist(strsplit(splitme, ",")) ),by=seq_len(nrow(DT))] 

The important thing by=seq_len(nrow(DT)) is the "fake" unique identifier on which the splitting occurs. It's tempting to use by=.I instead, since it needs to be defined the same way, but .I seems like a magical thing that changes its meaning, it's best to stick to by=seq_len(nrow(DT))

There are three columns in the output. We simply name the two existing columns and then calculate the third as split

 .( blah # first column of original , splitme # second column of original , splitted = unlist(strsplit(splitme, ",")) ) 
+2
Jul 14 '16 at 17:55
source share



All Articles