I'm currently trying to classify tweets using the Naive Bayes classifier in NLTK. I classify tweets associated with specific stock symbols using the $$ prefix (for example: $ AAPL). I base my Python script on this blog: Twitter sentiment analysis using Python and NLTK . So far I have been getting pretty good results. However, I feel that there are many opportunities for improvement.
In my word selection method, I decided to implement the tf-idf algorithm to select the most informative words. However, having done this, I felt that the results were not so impressive.
Then I applied this technique on the following blog: Text analysis Sensitivity analysis Elimination of low information functions . The results were very similar to the results obtained using the tf-idf algorithm, which led me to a more detailed examination of my classifier "Most informative functions". This is when I realized that I have a big problem:
Tweets and real language do not use the same grammar and wording. In plain text, many articles and verbs can be highlighted using tf-idf or stop words. However, in a tweet body, some extremely uninformative words, such as "the", "and", "is", etc., occur in the same way as words that are crucial for categorizing the text correctly. I cannot simply delete all words containing less than 3 letters, because some non-informative functions are more than some, and some informative are less.
If I could, I would not want to use stop words because of the need to frequently update the list. However, if this is my only option, I think I will have to go with it.
So, to summarize my question, does anyone know how to really get the most informative words in a particular source, which is Tweet?
EDIT: I am trying to classify three groups: positive, negative and neutral. Also, I was wondering, for TF-IDF, should I cut only words with low scores, as well as some of them with higher scores? In each case, what percentage of the text source vocabulary would you exclude from the function selection process?