Computing Point Mutual Information (PMI) for n-grams in Python

I have a large case of n-grams and several external n-grams. I want to calculate the PMI of each external n-gram based on this corpus (calculations).

Are there any tools for this, or can someone provide me with a piece of code in Python that can do this?

The problem is that my n-grams are 2 grams, 3 grams, 4 grams and 5 grams. Thus, calculating the probabilities for 3 grams or more really takes a lot of time.

+4
source share
1 answer

If I understand your problem correctly, you want to calculate things like log {P ("x1 x2 x3 x4 x5") / P ("x1") P ("x2") ... P ("x5")} where P measures the likelihood that any given 5-gram or 1-gram is a given thing (and is basically a ratio of samples, possibly to Laplace-style offsets). So, make one pass through your body and count the number (1) of each 1 gram, (2) of each n-gram (use a dict for the last), and then for each external n-gram you do a few dict searches, a bit of arithmetic, and everything is ready. One pass through the housing at the beginning, then a fixed amount of work per external n-gram.

(Note: I'm not really sure how to determine the PMI for more than two random variables, maybe it's something like log P (a) P (b) P (c) P (abc) / P (ab) P (bc) P (a_c) But if at all something happens along these lines, you can do it the same way: sort through your body, counting a lot of things, and then all the probabilities you need are simply related to graphs, possibly with amended by Laplace-Isha.)

If your case is so large that you cannot fit an n-gram dict into memory, then divide it into pieces the size of memory, calculate the n-gram dicts for each fragment and save them to disk in a form that allows you to access any given n-gram position is quite effective; then for each external n-gram go through the pieces and add counters.

What shape? You decide. One simple option: in the lexicographic order of the n-gram (note: if you are working with words, not letters, you can start by turning words into numbers, you need to do one preliminary passage through your corpus); then you want to find the n-gram, is it a binary search or something like that, which is 1 GB in size will mean somewhere around 15-20 requests per piece; You can add extra indexing to reduce this. Or: use a hash table on disk, with Berkeley DB or something else; in this case you can refuse chunking. Or, if the alphabet is small (for example, these are letters of n-grams, not words of n-grams, and you process plain text in English), just save them in a large array with direct search - but in this case, you can still put everything in memory.

+5
source

Source: https://habr.com/ru/post/1342796/


All Articles