Combining frequency tables into a single data block

I have a list in which each element of the list is a frequency table of words obtained from using "table ()" on another sample. Therefore, each table has a different length. I want to now convert the list into a single data frame in which each column is a word, each row is a sample of text. Here is a dummy example of my data:

t1<-table(strsplit(tolower("this is a test in the event of a real word file you would see many more words here"), "\\W")) t2<-table(strsplit(tolower("Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal"), "\\W")) t3<-table(strsplit(tolower("Ask not what your country can do for you - ask what you can do for your country"), "\\W")) myList <- list(t1, t2, t3) 

therefore, you can get this structure:

 > class(myList[[3]]) [1] "table" > myList[[3]] ask can country do for not what you your 2 2 2 2 2 2 1 2 2 2 

Now I need to convert this list (myList) into one data frame. I thought I could do it with plyr, according to what is being done here (http://ryouready.wordpress.com/2009/01/23/r-combining-vectors-or-data-frames-of- unequal-length-to-one-data-frame /) e.g.

 library(plyr) l <- myList do.call(rbind.fill, l) 

But it seems that my "table" objects do not play well. I tried converting them to dfs as well as vectors, but none of this worked.

+6
source share
3 answers
 freqs.list <- mapply(data.frame,Words=seq_along(myList),myList,SIMPLIFY=FALSE,MoreArgs=list(stringsAsFactors=FALSE)) freqs.df <- do.call(rbind,freqs.list) res <- reshape(freqs.df,timevar="Words",idvar="Var1",direction="wide") head(res) 
+4
source

1. the zoo . The zoo package has a multi-user merge function that can do this compactly. lapply converts each component of myList into a zoo object, and then we simply merge them all:

 # optionally add nice names to the list names(myList) <- paste("t", seq_along(myList), sep = "") library(zoo) fz <- function(x)with(as.data.frame(x, stringsAsFactors=FALSE), zoo(Freq, Var1))) out <- do.call(merge, lapply(myList, fz)) 

The above returns a multi-dimensional series of zoos in which "times" are "a" , "ago" , etc., but if the result of a data frame was desired, then this is just a matter of as.data.frame(out) .

2. Reduce . Here is the second solution. It uses Reduce in the R core.

 merge1 <- function(x, y) merge(x, y, by = 1, all = TRUE) out <- Reduce(merge1, lapply(myList, as.data.frame, stringsAsFactors = FALSE)) # optionally add nice names colnames(out)[-1] <- paste("t", seq_along(myList), sep = "") 

3. xtabs . This adds the names to the list and then extracts the frequencies, names and groups as one long vector, each of which puts them back using xtabs :

 names(myList) <- paste("t", seq_along(myList)) xtabs(Freq ~ Names + Group, data.frame( Freq = unlist(lapply(myList, unname)), Names = unlist(lapply(myList, names)), Group = rep(names(myList), sapply(myList, length)) )) 

Benchmark

Comparing some solutions using the rbenchmark package, we get the following, which indicates that the zoo's solution is the fastest on sample data and possibly the easiest.

 > t1<-table(strsplit(tolower("this is a test in the event of a real word file you would see many more words here"), "\\W")) > t2<-table(strsplit(tolower("Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal"), "\\W")) > t3<-table(strsplit(tolower("Ask not what your country can do for you - ask what you can do for your country"), "\\W")) > myList <- list(t1, t2, t3) > > library(rbenchmark) > library(zoo) > names(myList) <- paste("t", seq_along(myList), sep = "") > > benchmark(xtabs = { + names(myList) <- paste("t", seq_along(myList)) + xtabs(Freq ~ Names + Group, data.frame( + Freq = unlist(lapply(myList, unname)), + Names = unlist(lapply(myList, names)), + Group = rep(names(myList), sapply(myList, length)) + )) + }, + zoo = { + fz <- function(x) with(as.data.frame(x, stringsAsFactors=FALSE), zoo(Freq, Var1)) + do.call(merge, lapply(myList, fz)) + }, + Reduce = { + merge1 <- function(x, y) merge(x, y, by = 1, all = TRUE) + Reduce(merge1, lapply(myList, as.data.frame, stringsAsFactors = FALSE)) + }, + reshape = { + freqs.list <- mapply(data.frame,Words=seq_along(myList),myList,SIMPLIFY=FALSE,MoreArgs=list(stringsAsFactors=FALSE)) + freqs.df <- do.call(rbind,freqs.list) + reshape(freqs.df,timevar="Words",idvar="Var1",direction="wide") + }, replications = 10, order = "relative", columns = c("test", "replications", "relative")) test replications relative 2 zoo 10 1.000000 4 reshape 10 1.090909 1 xtabs 10 1.272727 3 Reduce 10 1.272727 

ADDED: second solution.

ADDED: third solution.

ADDED: reference.

+7
source

Here is an inelegant way that does its job. I am sure that there is a 1-liner, but I also do not know where:

  myList <- list(t1=t1, t2=t2, t3=t3) myList <- lapply(myList,as.data.frame,stringsAsFactors = FALSE) Words <- unique(unlist(lapply(myList,function(x) x[,1]))) DFmerge <- data.frame(Words=Words) for (i in 1:3){ DFmerge <- merge(DFmerge,myList[[i]],by.x="Words",by.y="Var1",all.x=TRUE) } colnames(DFmerge) <- c("Words","t1","t2","t3") 

And looking around a bit, here is another way that gives a result that is more similar to the result in a related blog: [Edit: works now]

  myList <- list(t1=t1, t2=t2, t3=t3) myList <- lapply(myList,function(x) { A <- as.data.frame(matrix(unlist(x),nrow=1)) colnames(A) <- names(x) A[,colnames(A) != ""] } ) do.call(rbind.fill,myList) 

It's also ugly, so maybe the best answer will come anyway.

+1
source

Source: https://habr.com/ru/post/908263/


All Articles