Tm.package: findAssocs vs Cosine

I am new here, and my questions are mathematical, not programmatic, where I would like to get a second opinion on whether my approach makes sense.

I tried to find the associations between the words in my case using the function findAssocsfrom the package tm. Despite the fact that it seems to work quite well with data available through the package, such as the New York Times and the US Congress, I was disappointed with my work on my own, a less accurate data set. It seems to be distorted by a rare document that contains several repetitions of the same words that seem to create a strong connection between them. I found that the cosine measure gives a better idea of ​​how the terms are related, although based on the literature, it is usually used to measure the similarity of documents, not terms. Let us use the USCongress data from the package RTextToolsto demonstrate what I mean:

Firstly, I’ll arrange everything ...

data(USCongress)

text = as.character(USCongress$text)

corp = Corpus(VectorSource(text)) 

parameters = list(minDocFreq        = 1, 
                  wordLengths       = c(2,Inf), 
                  tolower           = TRUE, 
                  stripWhitespace   = TRUE, 
                  removeNumbers     = TRUE, 
                  removePunctuation = TRUE, 
                  stemming          = TRUE, 
                  stopwords         = TRUE, 
                  tokenize          = NULL, 
                  weighting         = function(x) weightSMART(x,spec="ltn"))

tdm = TermDocumentMatrix(corp,control=parameters)

, "" "":

# Government: appears in 37 docs and between then it appears 43 times
length(which(text %like% " government"))
sum(str_count(text,"government"))

# Foreign: appears in 49 document and between then it appears 56 times
length(which(text %like% "foreign"))
sum(str_count(text,"foreign"))

length(which(text[which(text %like% "government")] %like% "foreign"))
# together they appear 3 times

# looking for "foreign" and "government"
head(as.data.frame(findAssocs(tdm,"foreign",0.1)),n=10)

             findAssocs(tdm, "foreign", 0.1)
countri                                 0.34
lookthru                                0.30
tuberculosi                             0.26
carryforward                            0.24
cor                                     0.24
malaria                                 0.23
hivaid                                  0.20
assist                                  0.19
coo                                     0.19
corrupt                                 0.19

# they do not appear to be associated

, " ", 50 :

text[4450] = gsub("(.*)",paste(rep("\\1",50),collapse=" "),"foreign government")
corp = Corpus(VectorSource(text)) 
tdm = TermDocumentMatrix(corp,control=parameters)

#running the association again:
head(as.data.frame(findAssocs(tdm,"foreign",0.1)),n=10)

             findAssocs(tdm, "foreign", 0.1)
govern                                  0.30
countri                                 0.29
lookthru                                0.26
tuberculosi                             0.22
cor                                     0.21
carryforward                            0.20
malaria                                 0.19
hivaid                                  0.17
assist                                  0.16
coo                                     0.16

, , .

- : , , . , , , , , . , - , . Term Document Matress , , , . , , , :

cosine(as.vector(tdm["government",]),as.vector(tdm["foreign",]))
     [,1]
[1,]    0

, 2 :

tdm.reduced = removeSparseTerms(tdm,0.98)

Proximity = function(tdm){ 
  d = dim(tdm)[1] 
  r = matrix(0,d,d,dimnames=list(rownames(tdm),rownames(tdm))) 
  for(i in 1:d){ 
    s = seq(1:d)[-c(1:(i-1))] 
    for(j in 1:length(s)){ 
      r[i,s[j]] = cosine(as.vector(tdm[i,]),as.vector(tdm[s[j],])) 
      r[s[j],i] = r[i,s[j]] 
    } 
  } 
  diag(r) = 0 
  return(r) 
}

rmat = Proximity(tdm.reduced)

# findAssocs method
head(as.data.frame(sort(findAssocs(tdm.reduced,"fund",0),decreasing=T)),n=10)

        sort(findAssocs(tdm.reduced, "fund", 0), decreasing = T)
use                                                         0.11
feder                                                       0.10
insur                                                       0.09
author                                                      0.07
project                                                     0.05
provid                                                      0.05
fiscal                                                      0.04
govern                                                      0.04
secur                                                       0.04
depart                                                      0.03

# cosine method
head(as.data.frame(round(sort(rmat[,"fund"],decreasing=T),2)),n=10)

       round(sort(rmat[, "fund"], decreasing = T), 2)
use                                              0.15
feder                                            0.14
bill                                             0.14
provid                                           0.13
author                                           0.12
insur                                            0.11
state                                            0.10
secur                                            0.09
purpos                                           0.09
amend                                            0.09

, , , , - . , , . , , .

, !

+4
1

(, , ). , , "" findAssocs . , , , .
, , . skmeans . Spherical K-Means TDM .

~ 11 , . , ... , .

+2

Source: https://habr.com/ru/post/1523745/


All Articles