R: Break an unbalanced list in the data.frame column

Question

R: Break an unbalanced list in the data.frame column

Suppose you have a data frame with the following structure:

df <- data.frame(a=c(1,2,3,4), b=c("job1;job2", "job1a", "job4;job5;job6", "job9;job10;job11"))

where column b is a comma delimited list (row unbalanced). The ideal data.frame would be:

 id,job,jobNum 1,job1,1 1,job2,2 ... 3,job6,3 4,job9,1 4,job10,2 4,job11,3

I have a partial solution that takes almost 2 hours (lines of 170 thousand lines):

 # Split the column by the semicolon. Results in a list. df$allJobs <- strsplit(df$b, ";", fixed=TRUE) # Function to reshape column that is a list as a data.frame simpleStack <- function(data){ start <- as.data.frame.list(data) names(start) <-c("id", "job") return(start) } # pylr! system.time(df2 <- ddply(df, .(id), simpleStack))

This is a size issue because if I run

 system.time(df2 <- ddply(df[1:4000,c("id", "allJobs")], .(id), simpleStack))

it takes only 9 seconds. The first conversion to a data.frames set with sapply (with another function) is fast, but the required "rbind" takes even longer.

+4

r dataframe plyr

Mike Jan 18 '11 at 2:00 p.m.

source share

2 answers

cSplit from my splitstacksahpe package is designed to handle this kind of data.

Here he is in action on this issue:

 df <- data.frame(a=c(1,2,3,4), b=c("job1;job2", "job1a", "job4;job5;job6", "job9;job10;job11")) # install.packages("splitstackshape") library(splitstackshape) cSplit(df, "b", ";", "long", makeEqual = FALSE) # a b_new # 1: 1 job1 # 2: 1 job2 # 3: 2 job1a # 4: 3 job4 # 5: 3 job5 # 6: 3 job6 # 7: 4 job9 # 8: 4 job10 # 9: 4 job11

You can also use strsplit inside "dplyr" and then monitor unnest with "tidyr", for example:

 library(dplyr) library(tidyr) df %>% mutate(b = strsplit(as.character(b), ";", fixed = TRUE)) %>% unnest(b) # ab # 1 1 job1 # 2 1 job2 # 3 2 job1a # 4 3 job4 # 5 3 job5 # 6 3 job6 # 7 4 job9 # 8 4 job10 # 9 4 job11

+6

A5C1D2H2I1M1N2O1R2T1 Aug 26 '13 at 8:48

source share

Richie cotton · Accepted Answer · 2011-01-18T14:09:22+0000

 #Split by ; as before allJobs <- strsplit(df$b, ";", fixed=TRUE) #Replicate a by the number of jobs in each case n <- sapply(allJobs, length) id <- rep(df$a, times = n) #Turn allJobs into a vector job <- unlist(allJobs) #Retrieve position of each job jobNum <- unlist(lapply(n, seq_len)) #Combine into a data frame df2 <- data.frame(id = id, job = job, jobNum = jobNum)

R: Break an unbalanced list in the data.frame column

More articles: