I have a classification problem, and I need to find out which approach is best to solve it. I have a set of training documents in which some sentences and / or paragraphs in documents are tagged with some tags. Not all sentences / paragraphs are marked. A sentence or paragraph may contain more than one label / tag. What I want to do is to create some model in which, given the new documents, it will give me suggested labels for each of the sentences / paragraphs inside the document. Ideally, this would give me more likely options with a high probability.
If I use something like nltk NaiveBayesClassifier, it gives bad results, I think, because it does not take into account "unlabeled" sentences from training documents, which will contain many similar words and phrases, as marked sentences. The documents are legal / financial in nature and filled with legal / financial jargon, most of which should be discounted in the classification model.
Is there any better Naive Bayes classification algorithm, or is there some way to push unmarked data into naive bays in addition to the tagged data from the training set?
source share