Calculate the correlation coefficient between words?

For a text analysis program, I would like to analyze the coincidence of some words in the text. For example, I would like to see that, for example, the words “Barack” and “Obama” more often appear together (ie have a positive correlation) than others.

It is not that difficult. However, to be honest, I only know how to calculate the correlation between two numbers, but not between two words in a text.

  • How can I best approach this issue?
  • How can I calculate the relationship between words?

I thought about using conditional probabilities, since, for example, Barack Obama is much more likely than Obama Barack; however, the problem I'm trying to solve is much more fundamental and does not depend on word ordering

+4
source share
4 answers

Ngram statistics package is designed specifically for this task. They have an online document that describes the association measures that they use. I myself have not used the package, so I can not comment on its reliability / requirements.

+3
source

Well, an easy way to solve your question is to generate data in a 2x2 matrix

obama | not obama barack AB not barack CD 

and evaluate all occurring bigrams in the matrix. So you can, for example, use a simple chi-square.

+1
source

I don’t know how this is usually done, but I can think of one rough way of defining the concept of correlation, which captures the adjacency of words.

Suppose text has length N, for example, it is an array

 text[0], text[1], ..., text[N-1] 

Suppose in the text

The following words appear:
 word[0], word[1], ..., word[k] 

For each vocabulary word [i], define a vector of length N-1

 X[i] = array(); // of length N-1 

as follows: the ith record of the vector is 1 if the word is either the ith word or the (i + 1) th word, and zero otherwise.

 // compute the vector X[i] for (j = 0:N-2){ if (text[j] == word[i] OR text[j+1] == word[i]) X[i][j] = 1; else X[i][j] = 0; } 

You can then calculate the correlation coefficient between the word [a] and the word [b] as the dot product between X [a] and X [b] (note that the dot product is the number of times these words are adjacent) divided by lengths ( the length is the square root of the number of occurrences of the word, which is possibly twice as much). Name this quantity COR (X [a], X [b]). It is clear that COR (X [a], X [a]) = 1, and COR (X [a], X [b]) is greater if the word [a], word [b] are often adjacent.

This can be generalized from “related” to other concepts almost - for example, we could use 3 blocks instead (or 4, 5, etc.). You can also add weight, perhaps do many more things if necessary. You could experiment to find out what is useful, if at all.

+1
source

This problem sounds like bigram, a sequence of two "tokens" in most of the text. See this Wikipedia article for more links to a more general n-gram problem.

If you want to do a full analysis, you are likely to take any pair of words and do a frequency analysis. For example, the sentence “Barack Obama is a Democratic candidate for president” has 8 words, so 8 choose 2 = 28 possible pairs.

Then you can ask statistical questions like “how many pairs does Obama follow“ Barak ”, and how many pairs does some other word (not“ Obama ”) follow“ Barak ”? In this case, there are 7 pairs that include“ Barack, "but only in one of them is it combined with Obama.

Do the same for each possible pair of words (for example, "how many pairs" does the candidate "match" the? "), And you have a basis for comparison.

0
source

Source: https://habr.com/ru/post/1440077/


All Articles