Natural Language Processing - Features for Text Classification

So, I'm trying to classify texts using Weka SVM. So far, my feature vectors used for SVM training consist of TF-IDF statistics for the unigrams and bigrams that are presented in the training texts. But the results that I get from testing the trained SVM model were not accurate at all, so can anyone give me feedback on my procedure? I take the following steps to classify texts:

  • Build a dictionary consisting of extracted unigrams and bigrams from educational texts.
  • Count how many times each unigram / bigram appears in each training text, and also how many training texts appear as unigram / bigram in
  • Use the data from step 2 to calculate TF-IDF for each unigram / bigram
  • For each document, create a function vector, which is the length of the dictionary, and save the corresponding TF-IDF statistics in each element of the vector (for example, the first element in the feature vector for one document will correspond to TF-IDF for the first word in the dictionary relative to the document 1 )
  • Add a class label to each object to distinguish which text belongs to the author
  • SVM train using these feature vectors
  • Functional vectors for test texts are constructed in the same way as training texts, and are classified by SVM

Also, maybe I need to train SVM with a lot of features? If so, which features are most effective in this case? Any help would be greatly appreciated, thanks.

+6
source share
2 answers

Natural language documents usually contain many words that appear only once, also known as Hapax Legomenon . For example, 44% of different words in Moby-Dick appear only once and 17% twice.

Therefore, including all words from the corpus, usually leads to an excessive amount of functions. To reduce the size of this space, NLP systems typically use one or more of the following elements:

  • Removing stop words - for the classification of authors, these are usually short and general words, such as is, the, at, which, etc.
  • Stemming - Popular stem (such as stem porter) use a set of rules to normalize the inflection of a word. For example, walks, walks, and walks are all related to the stem.
  • Threshold of correlation / significance. Calculate the Pearson correlation coefficient or p value of each function with respect to the class label. Then set the threshold value and delete the entire function that evaluates the value below this threshold.
  • Coverage threshold - similar to the threshold indicated above, delete all functions that are not displayed in at least t documents, where t is very small (<0.05%) relative to the entire case size.
  • Filtering based on a part of speech - for example, only examining verbs or removing nouns.
  • Filtering based on the type of system — for example, the NLP system for a clinical text can only consider words found in a medical dictionary.

To complete, remove stop words, index the case, and calculate tf_idf or document similarity, I would recommend using Lucene . Google "Lucene in 5 Minutes" for quick and easy learning how to use lucene.

+7
source

In these types of classification, it is important that your vector is not very large, because you can get many zeros in it, and this can adversely affect the results, because these vectors are too close, and it is difficult to separate them correctly. I would also recommend that you do not use every bigram, choose some with the highest frequency (in the text) in order to reduce the size of your vector and store enough information. Some kind of dodgy why it is recommended: http://en.wikipedia.org/wiki/Curse_of_dimensionality Last but not least, how much data you have, the larger your vector, the more data you should have.

+2
source

Source: https://habr.com/ru/post/946818/


All Articles