How do people use n-grams to analyze moods, given that as n grows, the need for memory also grows rapidly?

I am trying to do a sentiment analysis on tweets using Python.

For starters, I applied the n-gram model. So let's say our training data

I am a good kid He is a good kid, but he didn't get along with his sister much 

Unigrams :

 <i, am, a, good, kid, he, but, didnt, get, along, with, his, sister, much> 

bigram

 <(i am), (am a), (a good), (good kid), (he is), (is a), (kid but), (but he), (he didnt), (didnt get), (get along), (along with), (with his), (his sister), (sister much)> 

Trigrams

 <(i am a), (am a good), (a good kid), .........> 

The final function vector :

 <i, am, a, good, kid, he, but, didnt, get, along, with, his, sister, much, (i am), (am a), (a good), (good kid), (he is), (is a), (kid but), (but he), (he didnt), (didnt get), (get along), (along with), (with his), (his sister), (sister much), (i am a), (am a good), (a good kid), .........> 

When we do this for big training data, from 8,000 or so records, the dimension of the vector function becomes too HUGE, resulting in a crash of my computer (RAM = 16 GB).

So, when people mention the use of "n-grams" as functions, in the 100s documents there, what are they talking about? Am I doing something wrong?

Do people always have a choice of functions for "n-grams"? If so, what type of choice should I look for?

I use scikit-learn for this

+5
source share
2 answers

If you save your last function vector exactly as you wrote, I think I can come up with some improvements.

The problem with memory is that functions (texts) are repeated many times, as well as tokens. Consider this process:

First of all, all individual functions are saved (and set by index).

For instance,

1 - feature1 - (i)

2 - feature2 - (am a)

...

This creates the so-called spatial space.

In general, there may be thousands of functions or even more. But that should be fine. Then each record can be rewritten as a serial number, for example,

Entry1 ----- <1,1,1,0, .... a_n>, where the first function 1 means 1 (i) has 1 occurrence in this entry, and a_n is the number of occurrences of the function n.

Suppose there are many functions and the entries are short, which means there are too many zeros in each vector. We can rewrite the previous vector as follows:

Entry1 ---- {1: 1,2: 1,3: 1}, which means that the value of the 1/2/3 function of Entry1 is 1, and the values ​​of all other functions are zeros. In short, right?

At the end, each entry appears as a short vector, and you get a large matrix for your body. Your case may now look like this:

{1: 1, 2: 1, 3: 1}

{2: 1, 29: 1, 1029: 1, 20345: 1}

...

16 GB of RAM is enough for 8000 entries. You can use a lot less.


And further, if you have too many different tokens (which means too many functions). When constructing the space of objects, what can be done is to remove functions whose frequency is below the threshold, say, 3 times. The size of the space of objects can be subtracted by half or even less.

+5
source

As the G4dget inspector points out in the comments, you rarely go for high n-grams, for example. n = 5 or n = 6 because you do not have enough training data to make it useful. In other words, almost all of your 6 grams will have an error count of 1. Also, to indicate inspectorG4dget's comment:

When these papers talk about n-grams, they don’t talk about scalable n - they ALWAYS talk about a specific n (the value can be found in the results or experiments section)

So, usually memory is not the biggest problem. With a really big case, you split them into a cluster, and then combine the results at the end. You can partition based on how much memory the node has in the cluster, or when processing the stream, then you can stop and load the results (in the central node) every time you fill the memory.

There are several optimizations you can do. If the case is stored in memory, then each n-gram should only be the index of the first entry into the case; the line does not need to be repeated.

The second optimization, if you don't mind multiple passes, is to use the result of (n-1) -gram to skip parts of the sentence below your threshold. For instance. if you are only interested in n-grams that occur 3 times, and if "He is smart", then only 4-gram analysis has a score of 2, then when you find 5-grams "He is a smart dog", you can throw it away as you know, this happens only once or twice. This is an optimization of memory due to the additional processor.

+2
source

Source: https://habr.com/ru/post/1206509/


All Articles