The best way to classify tagged offers from a set of documents

I have a classification problem, and I need to find out which approach is best to solve it. I have a set of training documents in which some sentences and / or paragraphs in documents are tagged with some tags. Not all sentences / paragraphs are marked. A sentence or paragraph may contain more than one label / tag. What I want to do is to create some model in which, given the new documents, it will give me suggested labels for each of the sentences / paragraphs inside the document. Ideally, this would give me more likely options with a high probability.

If I use something like nltk NaiveBayesClassifier, it gives bad results, I think, because it does not take into account "unlabeled" sentences from training documents, which will contain many similar words and phrases, as marked sentences. The documents are legal / financial in nature and filled with legal / financial jargon, most of which should be discounted in the classification model.

Is there any better Naive Bayes classification algorithm, or is there some way to push unmarked data into naive bays in addition to the tagged data from the training set?

+6
source share
2 answers

Here is what I would like to do to slightly modify the existing approach: prepare for each possible tag a single classifier for each sentence. Include all sentences that do not express this tag as negative examples for the tag (this will implicitly count unlabeled examples). For a new test sentence, run all n classifiers and save the classes adjusted above a certain threshold as labels for the new sentence.

I would probably use something other than Naive Bayes. Logistic regression (MaxEnt) is the obvious choice if you want something probabilistic: SVMs are very strong if you don't care about probabilities (and I don't think you're doing it now).

This is really a task for marking sequences, and ideally you can also add forecasts from the closest sentences ... but, as far as I know, there is no fundamental extension of CRFs / StructSVM or other approaches to selection methods that allow instances to have several shortcuts.

+2
source

is there some way to push unmarked data into naive bays

There is no difference between โ€œtaggedโ€ and โ€œuntaggedโ€ data, Naive Bayes builds simple conditional probabilities, in particular P(label|attributes) and P(no label|attributes) , so it is largely based on the processing pipeline used, but I I doubt very much that he actually ignores unmarked parts. If this is for some reason and you do not want to change the code, you can also enter an artificial โ€œno labelโ€ label for all remaining text segments.

Is there a better classification algorithm that naive Bayes

Yes, NB is actually the most basic model, and there are dozens of the best (stronger, more general) that provide better results in marking text, including:

  • Hidden Markov Models (HMM)
  • Conditional Random Fields (CRF)
  • in general - probabilistic graphical models (PGM)
+1
source

Source: https://habr.com/ru/post/954200/


All Articles