NLP text tagging

Question

NLP text tagging

I'm new to NLP, just doing it for the first time. I am trying to solve a problem.

My problem is that I have some documents that are manually marked as doc1 - categoryA, categoryB doc2 - categoryA, categoryC doc3 - categoryE, categoryF, categoryG,,,, docN - categoryX

Here I have a fixed set of categories, and any document can have any number of tags associated with it. I want to train the classifier with this input so that this marking process can be automated.

thanks

+4

nlp

user1168811 Jan 25 '12 at 9:33

source share

3 answers

John lehmann · Answer 1 · 2012-01-25T16:36:40+0000

What you are trying to do is called multi-level text categorization (or classification) . Knowing the right question to ask a question is half the problem.

How to do this, here are two links:

user123 · Answer 2 · 2015-03-23T06:06:44+0000

Most classifiers work on a bag dictionary model . To get the expected result, there are several use cases.

Try the most common multicomponent naive base class with changing different input parameters and the result of the test.
Try the ML Naive database options ( http://scikit-learn.org/0.11/modules/naive_bayes.html )
You can check the classifier of offers along with consideration of the structures of offers. Looking at ngram concepts, you can try out 2,3,4,5 grams of models and see how the result changes. Countizerizer allows ngram, see this link, for example - http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

Based on the functions of the data set, no classifier can be the best for you, you need to check another use case that suits you best.

The very initial approach is that you start with a simple classifier using scikit learn.

Put each category in the training class and train the classifier using these classes
For any docX input, a classifier with a trained model
You will get a probability result for each category.
Now add some threshold as a probability that differs between the three highest resulting categories, if it matches the threshold, consider this category as the result for this input class.

ryder1211212 · Answer 3 · 2017-02-01T20:12:29+0000

it is not clear what you tried or what programming language you use, but since most suggested trying a text classification with document vectors, a bag of words (if there are words in the documents that can help in the classification)

Here are some simple tools to get you started.

 Weka http://www.cs.waikato.ac.nz/ml/weka/ (GUI & Java) NLTK http://www.nltk.org (Python) Mallet http://mallet.cs.umass.edu/ (command line & Java) NUML http://numl.net/ (C#)

NLP text tagging

More articles: