Which classifier to choose in NLTK

I want to classify text messages into several categories, for example: “building relationships”, “coordination”, “information sharing”, “knowledge sharing” and “conflict resolution”. I use the NLTK library to process this data. I would like to know which classifier in nltk is better suited for this specific task of classifying several classes.

I plan to use the classification of naive bayes, is this desirable?

+6
source share
2 answers

Naive Bayes is the simplest and most clear classifier, and for this reason it is nice to use it. Solution Trees with a ray search to find a better classification are not much harder to understand, and usually a little better. MaxEnt and SVM tend to be more complex, and SVM requires some configuration in order to qualify.

The most important is the choice of functions + the quantity / quality of the data that you provide!

With your problem, I would first focus on having a good set for training / testing, as well as choosing good features. Since you are asking this question, you did not have much experience in learning machines for NLP, so I would say that starting with Naive Bayes is easy because it does not use complex functions - you can simply label and count words.

EDIT: Question How do you find the topic of a sentence? , and my answer is also worth a look.

+8
source

Yes, preparing a Naive Bayes classifier for each category and then labeling each message with the class on the basis of which the classifier provides the highest score is the standard first approach to such problems. There are more sophisticated classifier algorithms that you can replace for Naive Bayes if you find that performance is poor, such as the Vector Vector Machine (which, I believe, is available in NLTK through the Weka plugin, but not positive). If you cannot come up with anything specific in this problem area that would make Naieve Bayes especially unsuitable, then you could try it first for many projects.

Another NLTK classifier that I would like to consider will be MaxEnt, since I believe that it handles the multiclass classification initially. (Although the plural binary cool approach is very standard and common). In any case, the most important thing is to assemble a very large body of correctly marked text messages.

If in the “Text Messages” section you refer to the actual text messages of a cell phone, they are usually very short and the language is very informal and varied, I think that the choice of function can become a more important factor in determining the accuracy than the choice of classifier for you. For example, using a Stemmer or Lemmatizer, which understands common abbreviations and idioms used, tagging parts of speech or fragmentation, extracting an entity, extracting probably the relationship between terms, can give more hits than using more complex classifiers.

This document talks about classifying Facebook status messages based on moods that have some of the same issues and may provide some insight into this. The links are linked to the google cache because I am having problems with the source site:

http://docs.google.com/viewer?a=v&q=cache:_AeBYp6i1ooJ:nlp.stanford.edu/courses/cs224n/2010/reports/ssoriajr-kanej.pdf+maxent+classifier+multiple+classes&hl=en&gl= us & pid = bl & srcid = ADGEESi-eZHTZCQPo7AlcnaFdUws9nSN1P6X0BVmHjtlpKYGQnj7dtyHmXLSONa9Q9ziAQjliJnR8yD1Z-0WIpOjcmYbWO2zcB6z4RzkIhYI_Dfzx2WqU4jy2Le4wrEQv0yZp_QZyHQN & sig = AHIEtbQN4J_XciVhVI60oyrPb4164u681w & pli = 1

+2
source

Source: https://habr.com/ru/post/892055/


All Articles