Suppose you have a data frame with the following structure:
df <- data.frame(a=c(1,2,3,4), b=c("job1;job2", "job1a", "job4;job5;job6", "job9;job10;job11"))
where column b is a comma delimited list (row unbalanced). The ideal data.frame would be:
id,job,jobNum 1,job1,1 1,job2,2 ... 3,job6,3 4,job9,1 4,job10,2 4,job11,3
I have a partial solution that takes almost 2 hours (lines of 170 thousand lines):
# Split the column by the semicolon. Results in a list. df$allJobs <- strsplit(df$b, ";", fixed=TRUE)
This is a size issue because if I run
system.time(df2 <- ddply(df[1:4000,c("id", "allJobs")], .(id), simpleStack))
it takes only 9 seconds. The first conversion to a data.frames set with sapply (with another function) is fast, but the required "rbind" takes even longer.