I am new here, and my questions are mathematical, not programmatic, where I would like to get a second opinion on whether my approach makes sense.
I tried to find the associations between the words in my case using the function findAssocsfrom the package tm. Despite the fact that it seems to work quite well with data available through the package, such as the New York Times and the US Congress, I was disappointed with my work on my own, a less accurate data set. It seems to be distorted by a rare document that contains several repetitions of the same words that seem to create a strong connection between them. I found that the cosine measure gives a better idea of how the terms are related, although based on the literature, it is usually used to measure the similarity of documents, not terms. Let us use the USCongress data from the package RTextToolsto demonstrate what I mean:
Firstly, I’ll arrange everything ...
data(USCongress)
text = as.character(USCongress$text)
corp = Corpus(VectorSource(text))
parameters = list(minDocFreq = 1,
wordLengths = c(2,Inf),
tolower = TRUE,
stripWhitespace = TRUE,
removeNumbers = TRUE,
removePunctuation = TRUE,
stemming = TRUE,
stopwords = TRUE,
tokenize = NULL,
weighting = function(x) weightSMART(x,spec="ltn"))
tdm = TermDocumentMatrix(corp,control=parameters)
, "" "":
length(which(text %like% " government"))
sum(str_count(text,"government"))
length(which(text %like% "foreign"))
sum(str_count(text,"foreign"))
length(which(text[which(text %like% "government")] %like% "foreign"))
head(as.data.frame(findAssocs(tdm,"foreign",0.1)),n=10)
findAssocs(tdm, "foreign", 0.1)
countri 0.34
lookthru 0.30
tuberculosi 0.26
carryforward 0.24
cor 0.24
malaria 0.23
hivaid 0.20
assist 0.19
coo 0.19
corrupt 0.19
, " ", 50 :
text[4450] = gsub("(.*)",paste(rep("\\1",50),collapse=" "),"foreign government")
corp = Corpus(VectorSource(text))
tdm = TermDocumentMatrix(corp,control=parameters)
head(as.data.frame(findAssocs(tdm,"foreign",0.1)),n=10)
findAssocs(tdm, "foreign", 0.1)
govern 0.30
countri 0.29
lookthru 0.26
tuberculosi 0.22
cor 0.21
carryforward 0.20
malaria 0.19
hivaid 0.17
assist 0.16
coo 0.16
, , .
- : , , . , , , , , . , - , . Term Document Matress , , , . , , , :
cosine(as.vector(tdm["government",]),as.vector(tdm["foreign",]))
[,1]
[1,] 0
, 2 :
tdm.reduced = removeSparseTerms(tdm,0.98)
Proximity = function(tdm){
d = dim(tdm)[1]
r = matrix(0,d,d,dimnames=list(rownames(tdm),rownames(tdm)))
for(i in 1:d){
s = seq(1:d)[-c(1:(i-1))]
for(j in 1:length(s)){
r[i,s[j]] = cosine(as.vector(tdm[i,]),as.vector(tdm[s[j],]))
r[s[j],i] = r[i,s[j]]
}
}
diag(r) = 0
return(r)
}
rmat = Proximity(tdm.reduced)
head(as.data.frame(sort(findAssocs(tdm.reduced,"fund",0),decreasing=T)),n=10)
sort(findAssocs(tdm.reduced, "fund", 0), decreasing = T)
use 0.11
feder 0.10
insur 0.09
author 0.07
project 0.05
provid 0.05
fiscal 0.04
govern 0.04
secur 0.04
depart 0.03
head(as.data.frame(round(sort(rmat[,"fund"],decreasing=T),2)),n=10)
round(sort(rmat[, "fund"], decreasing = T), 2)
use 0.15
feder 0.14
bill 0.14
provid 0.13
author 0.12
insur 0.11
state 0.10
secur 0.09
purpos 0.09
amend 0.09
, , , , - . , , . , , .
, !