(apologies, I was not sure what the best title for this post would be, feel free to edit).
Suppose I have the following relational structure between words and their type (for example, a dictionary):
dictionary <- data.frame(level1=c(rep("Positive", 3), rep("Negative", 3)), level2 = c("happy", "fantastic", "great", "sad", "rubbish", "awful"))
and we calculated their occurrences in seven documents (i.e., the term-document matrix):
set.seed(42) range = 0:3 df <- data.frame(row.names = c("happy", "fantastic", "great", "sad", "rubbish", "awful"), doc1 = sample(x=range, size=6, replace=TRUE), doc2 = sample(x=range, size=6, replace=TRUE), doc3 = sample(x=range, size=6, replace=TRUE), doc4 = sample(x=range, size=6, replace=TRUE), doc5 = sample(x=range, size=6, replace=TRUE), doc6 = sample(x=range, size=6, replace=TRUE), doc7 = sample(x=range, size=6, replace=TRUE))
Then I can easily calculate how often two words appear in the same document (i.e., a match or adjacency matrix):
# binary to indicate a co-occurrence df[df > 0] <- 1 # sum co-occurrences m <- as.matrix(df) %*% t(as.matrix(df)) # happy fantastic great sad rubbish awful # happy 5 4 5 4 4 4 # fantastic 4 5 5 4 4 4 # great 5 5 7 6 6 6 # sad 4 4 6 6 5 5 # rubbish 4 4 6 5 6 5 # awful 4 4 6 5 5 6
Question: How can I restructure my match matrix so that I consider the type of word (level1) in the dictionary, and not just the words themselves (level2)?
i.e. I would like to:
data.frame(row.names = c("Positive", "Negative"), Positive = c(5+4+5+4+5+5+5+5+7, 4+4+6+4+4+6+4+4+6), Negative = c(4+4+4+4+4+4+6+6+6, 6+5+5+5+6+5+5+5+6))
What I have done so far: I used to hope that I could deduce the process from this question Combine columns based on data.frame by type of name
However, although I can reduce the lines:
require(data.table) dt <- data.table(m) dt[, level1:=c(rep("Positive", 3), rep("Negative", 3))] dt[, lapply(.SD, sum), by = "level1"]
I cannot figure out how to reduce the required columns.