Tweet Feature-Selection NLTK Classifier

Question

Tweet Feature-Selection NLTK Classifier

I'm currently trying to classify tweets using the Naive Bayes classifier in NLTK. I classify tweets associated with specific stock symbols using the $$ prefix (for example: $ AAPL). I base my Python script on this blog: Twitter sentiment analysis using Python and NLTK . So far I have been getting pretty good results. However, I feel that there are many opportunities for improvement.

In my word selection method, I decided to implement the tf-idf algorithm to select the most informative words. However, having done this, I felt that the results were not so impressive.

Then I applied this technique on the following blog: Text analysis Sensitivity analysis Elimination of low information functions . The results were very similar to the results obtained using the tf-idf algorithm, which led me to a more detailed examination of my classifier "Most informative functions". This is when I realized that I have a big problem:

Tweets and real language do not use the same grammar and wording. In plain text, many articles and verbs can be highlighted using tf-idf or stop words. However, in a tweet body, some extremely uninformative words, such as "the", "and", "is", etc., occur in the same way as words that are crucial for categorizing the text correctly. I cannot simply delete all words containing less than 3 letters, because some non-informative functions are more than some, and some informative are less.

If I could, I would not want to use stop words because of the need to frequently update the list. However, if this is my only option, I think I will have to go with it.

So, to summarize my question, does anyone know how to really get the most informative words in a particular source, which is Tweet?

EDIT: I am trying to classify three groups: positive, negative and neutral. Also, I was wondering, for TF-IDF, should I cut only words with low scores, as well as some of them with higher scores? In each case, what percentage of the text source vocabulary would you exclude from the function selection process?

+6

python machine-learning twitter classification nltk

elliottbolzan Jan 08 '12 at 15:54

source share

1 answer

David robinson · Answer 1 · 2012-01-08T16:27:07+0000

On the blog, you point to show_most_informative_features , but NaiveBayesClassifier also has most_informative_features , which returns functions, not just print them. You could simply set a cutoff based on your sets of training tasks, such as "the", "and" and other non-essential functions, which will be at the bottom of the list in terms of information content.

It’s true that this approach may be subject to retraining (some functions will be much more important in your training set than in your test set), but this will be true for everything that filters functions based on your training set.

Tweet Feature-Selection NLTK Classifier

More articles: