Join data.frame columns based on name type

Question

Join data.frame columns based on name type

Let's say I have the following data.frame that associates the package name R with the CRAN task view that belongs to:

dictionary <- data.frame(task.view = c(rep("High.Performance.Computing", 3), rep("Machine.Learning", 3)), package = c("Rcpp", "HadoopStreaming", "rJava", "e1071", "nnet", "RWeka")) # task.view package # High.Performance.Computing Rcpp # High.Performance.Computing HadoopStreaming # High.Performance.Computing rJava # Machine.Learning e1071 # Machine.Learning nnet # Machine.Learning RWeka

Then I count the number of times each package is called from one of four tools written by the student:

 package.referals <- data.frame(Rcpp = c(1, 0, 1, 1), HadoopStreaming = c(1, 0, 0, 0), rJava = c(1, 0, 0, 1), e1071 = c(1, 1, 1, 1), nnet = c(1, 0, 0, 0), RWeka = c(1, 0, 0, 1), row.names = paste("student pkg", 1:4)) # Rcpp HadoopStreaming rJava e1071 nnet RWeka # student pkg 1 1 1 1 1 1 1 # student pkg 2 0 0 0 1 0 0 # student pkg 3 1 0 0 1 0 0 # student pkg 4 1 0 1 1 0 1

How can I restructure the columns of my package.referals data.frame above based on my data.frame package task view relationships?

eg. I would like the result to be

 data.frame(High.Performance.Computing = c(3, 0, 1, 2), Machine.Learning = c(3, 1, 1, 2), row.names = paste("student pkg", 1:4)) # High.Performance.Computing Machine.Learning # student pkg 1 3 3 # student pkg 2 0 1 # student pkg 3 1 1 # student pkg 4 2 2

I tried the following, but I got stuck when trying to restructure it into the output file that I would like (summation and transfer):

 require(data.table) # column names of package.referals data.frame package.referals.colnames <- names(package.referals) # a data.table of my task view and package relations, keyed by package name dictionary.dt <- data.table(dictionary, key = "package") # a data.table of my package.referals data.frame, transposed, and keyed by package name package.referals.dt <- data.table(package = package.referals.colnames, t(package.referals), key="package") # Joining data.tables so that the package name and corresponding task view are on the same line dt <- package.referals.dt[J(dictionary.dt)] setkey(dt, "task.view") # package student pkg 1 student pkg 2 student pkg 3 student pkg 4 task.view # 1: HadoopStreaming 1 0 0 0 High.Performance.Computing # 2: Rcpp 1 0 1 1 High.Performance.Computing # 3: rJava 1 0 0 1 High.Performance.Computing # 4: e1071 1 1 1 1 Machine.Learning # 5: nnet 1 0 0 0 Machine.Learning # 6: RWeka 1 0 0 1 Machine.Learning

+2

r dataframe data.table

Tony breyal Oct 11 '13 at 14:46

source share

3 answers

You can map and rename the columns of package.referals , and then make rowSums on the columns with the same name ...

 names( package.referals ) <- dictionary$task.view[ match( names( package.referals ) , dictionary$package ) ] sapply( unique( names( package.referals ) ) , function(x) rowSums( package.referals[ , names( package.referals ) %in% x ] ) ) # High.Performance.Computing Machine.Learning #student pkg 1 3 3 #student pkg 2 0 1 #student pkg 3 1 1 #student pkg 4 2 2

+2

Simon O'Hanlon Oct 11 '13 at 15:08

source share

You can also insert all the information into a single data.frame , and then aggregate :

  dictionary <- data.frame(task.view = c(rep("High.Performance.Computing", 3), rep("Machine.Learning", 3)), package = c("Rcpp", "HadoopStreaming", "rJava", "e1071", "nnet", "RWeka")) package.referals <- data.frame(Rcpp = c(1, 0, 1, 1), HadoopStreaming = c(1, 0, 0, 0), rJava = c(1, 0, 0, 1), e1071 = c(1, 1, 1, 1), nnet = c(1, 0, 0, 0), RWeka = c(1, 0, 0, 1), row.names = paste("student pkg", 1:4)) pack.ref <- as.data.frame(t(package.referals)) #transpose for easier manipulation pack.ref$task.view <- as.character(dictionary$task.view[unlist(lapply(colnames(package.referals), grep, dictionary$package))]) #add column with "task.view" of each package (here is obvious) DF <- as.data.frame(t(aggregate(pack.ref[,1:4], by = list(pack.ref$task.view), sum))) #"aggregate" DF # V1 V2 #Group.1 High.Performance.Computing Machine.Learning #student pkg 1 3 3 #student pkg 2 0 1 #student pkg 3 1 1 #student pkg 4 2 2

+2

alexis_laz Oct 11 '13 at 15:23

source share

juba · Accepted Answer · 2013-10-11T15:10:51+0000

Here is a solution with reshape and R base:

 package.referals$id <- rownames(package.referals) pkgr <- melt(package.referals, variable.name="package") pkgr <- pkgr[pkgr$value>0,] df <- merge(pkgr, dictionary, all.x=TRUE) table(df$id, df$task.view)

If you really want to use data.table instead of merge , you can replace the last three lines as follows:

 pkgr <- data.table(pkgr, key="package") dictionary <- data.table(dictionary, key="package") df <- pkgr[dictionary] table(df$id, df$task.view)

Join data.frame columns based on name type

More articles: