Build a list of word match edges in R

I have a piece of sentences, and I want to build a list of non-oriented words matching edges and see the frequency of each edge. I looked at the tm package, but did not find such functions. Is there any package / script I can use? Many thanks!

Note. The word does not occur together with itself. A word that appears twice or more occurs with other words only once in the same sentence.

DF:

 sentence_id text 1 abcde 2 abbe 3 bcd 4 ae 5 a 6 aaa 

EXIT

 word1 word2 freq ab 2 ac 1 ad 1 ae 3 bc 2 bd 2 be 2 cd 2 ce 1 de 1 
+5
source share
3 answers

This has curtailed, so there should be a better approach:

 dat <- read.csv(text="sentence_id, text 1, abcde 2, abbe 3, bcd 4, ae", header=TRUE) library(qdapTools); library(tidyr) x <- t(mtabulate(with(dat, by(text, sentence_id, bag_o_words))) > 0) out <- x %*% t(x) out[upper.tri(out, diag=TRUE)] <- NA out2 <- matrix2df(out, "word1") %>% gather(word2, freq, -word1) %>% na.omit() rownames(out2) <- NULL out2 ## word1 word2 freq ## 1 ba 2 ## 2 ca 1 ## 3 da 1 ## 4 ea 3 ## 5 cb 2 ## 6 db 2 ## 7 eb 2 ## 8 dc 2 ## 9 ec 1 ## 10 ed 1 

Basic solution only

 out <- lapply(with(dat, split(text, sentence_id)), function(x) { strsplit(gsub("^\\s+|\\s+$", "", as.character(x)), "\\s+")[[1]] }) nms <- sort(unique(unlist(out))) out2 <- lapply(out, function(x) { as.data.frame(table(x), stringsAsFactors = FALSE) }) dat2 <- data.frame(x = nms) for(i in seq_along(out2)) { m <- merge(dat2, out2[[i]], all.x = TRUE) names(m)[i + 1] <- dat[["sentence_id"]][i] dat2 <- m } dat2[is.na(dat2)] <- 0 x <- as.matrix(dat2[, -1]) > 0 out3 <- x %*% t(x) out3[upper.tri(out3, diag=TRUE)] <- NA dimnames(out3) <- list(dat2[[1]], dat2[[1]]) out4 <- na.omit(data.frame( word1 = rep(rownames(out3), ncol(out3)), word2 = rep(colnames(out3), each = nrow(out3)), freq = c(unlist(out3)), stringsAsFactors = FALSE) ) row.names(out4) <- NULL out4 
+2
source

This is very closely related to @TylerRinker's answer, but using different tools.

 library(splitstackshape) library(reshape2) temp <- crossprod( as.matrix( cSplit_e(d, "text", " ", type = "character", fill = 0, drop = TRUE)[-1])) temp[upper.tri(temp, diag = TRUE)] <- NA melt(temp, na.rm = TRUE) # Var1 Var2 value # 2 text_b text_a 2 # 3 text_c text_a 1 # 4 text_d text_a 1 # 5 text_e text_a 3 # 8 text_c text_b 2 # 9 text_d text_b 2 # 10 text_e text_b 2 # 14 text_d text_c 2 # 15 text_e text_c 1 # 20 text_e text_d 1 

The "text_" parts of "Var1" and "Var2" can be easily removed using sub or gsub .

+1
source

Here's the basic R-way:

 d <- read.table(text='sentence_id text 1 "abcde" 2 "abbe" 3 "bcd" 4 "ae"', header=TRUE, as.is=TRUE) result.vec <- table(unlist(lapply(d$text, function(text) { pairs <- combn(unique(scan(text=text, what='', sep=' ')), m=2) interaction(pairs[1,], pairs[2,]) }))) # ab bb cb db ac bc cc dc ad bd cd dd ae be ce de # 2 0 0 0 1 2 0 0 1 2 2 0 3 2 1 1 result <- subset(data.frame(do.call(rbind, strsplit(names(result.vec), '\\.')), freq=as.vector(result.vec)), freq > 0) with(result, result[order(X1, X2),]) # X1 X2 freq # 1 ab 2 # 5 ac 1 # 9 ad 1 # 13 ae 3 # 6 bc 2 # 10 bd 2 # 14 be 2 # 11 cd 2 # 15 ce 1 # 16 de 1 
0
source

Source: https://habr.com/ru/post/1207750/


All Articles