NLP text tagging

I'm new to NLP, just doing it for the first time. I am trying to solve a problem.

My problem is that I have some documents that are manually marked as doc1 - categoryA, categoryB doc2 - categoryA, categoryC doc3 - categoryE, categoryF, categoryG,,,, docN - categoryX

Here I have a fixed set of categories, and any document can have any number of tags associated with it. I want to train the classifier with this input so that this marking process can be automated.

thanks

+4
source share
3 answers

What you are trying to do is called multi-level text categorization (or classification) . Knowing the right question to ask a question is half the problem.

How to do this, here are two links:

+4
source

Most classifiers work on a bag dictionary model . To get the expected result, there are several use cases.

Based on the functions of the data set, no classifier can be the best for you, you need to check another use case that suits you best.

The very initial approach is that you start with a simple classifier using scikit learn.

  • Put each category in the training class and train the classifier using these classes

  • For any docX input, a classifier with a trained model

  • You will get a probability result for each category.
  • Now add some threshold as a probability that differs between the three highest resulting categories, if it matches the threshold, consider this category as the result for this input class.
+3
source

it is not clear what you tried or what programming language you use, but since most suggested trying a text classification with document vectors, a bag of words (if there are words in the documents that can help in the classification)

Here are some simple tools to get you started.

 Weka http://www.cs.waikato.ac.nz/ml/weka/ (GUI & Java) NLTK http://www.nltk.org (Python) Mallet http://mallet.cs.umass.edu/ (command line & Java) NUML http://numl.net/ (C#) 
0
source

Source: https://habr.com/ru/post/1392882/


All Articles