I am trying to do a sentiment analysis on tweets using Python.
For starters, I applied the n-gram model. So let's say our training data
I am a good kid He is a good kid, but he didn't get along with his sister much
Unigrams :
<i, am, a, good, kid, he, but, didnt, get, along, with, his, sister, much>
bigram
<(i am), (am a), (a good), (good kid), (he is), (is a), (kid but), (but he), (he didnt), (didnt get), (get along), (along with), (with his), (his sister), (sister much)>
Trigrams
<(i am a), (am a good), (a good kid), .........>
The final function vector :
<i, am, a, good, kid, he, but, didnt, get, along, with, his, sister, much, (i am), (am a), (a good), (good kid), (he is), (is a), (kid but), (but he), (he didnt), (didnt get), (get along), (along with), (with his), (his sister), (sister much), (i am a), (am a good), (a good kid), .........>
When we do this for big training data, from 8,000 or so records, the dimension of the vector function becomes too HUGE, resulting in a crash of my computer (RAM = 16 GB).
So, when people mention the use of "n-grams" as functions, in the 100s documents there, what are they talking about? Am I doing something wrong?
Do people always have a choice of functions for "n-grams"? If so, what type of choice should I look for?
I use scikit-learn for this