Natural language documents usually contain many words that appear only once, also known as Hapax Legomenon . For example, 44% of different words in Moby-Dick appear only once and 17% twice.
Therefore, including all words from the corpus, usually leads to an excessive amount of functions. To reduce the size of this space, NLP systems typically use one or more of the following elements:
- Removing stop words - for the classification of authors, these are usually short and general words, such as is, the, at, which, etc.
- Stemming - Popular stem (such as stem porter) use a set of rules to normalize the inflection of a word. For example, walks, walks, and walks are all related to the stem.
- Threshold of correlation / significance. Calculate the Pearson correlation coefficient or p value of each function with respect to the class label. Then set the threshold value and delete the entire function that evaluates the value below this threshold.
- Coverage threshold - similar to the threshold indicated above, delete all functions that are not displayed in at least t documents, where t is very small (<0.05%) relative to the entire case size.
- Filtering based on a part of speech - for example, only examining verbs or removing nouns.
- Filtering based on the type of system — for example, the NLP system for a clinical text can only consider words found in a medical dictionary.
To complete, remove stop words, index the case, and calculate tf_idf or document similarity, I would recommend using Lucene . Google "Lucene in 5 Minutes" for quick and easy learning how to use lucene.
source share