What is the meaning of “cutoff” and “iteration” for training in OpenNLP?

What is the significance of cut-off and iteration for learning in OpenNLP? or, for that matter, natural language processing. I just need to talk about these conditions. As far as I think, iteration is the number of repetitions of the algorithm and clipping is such a value that if the text has a value above this cutoff for a certain category, it will be displayed in that category. I'm right?

+7
source share
2 answers

Correctly, the term iteration refers to the general concept of iterative algorithms, where each of them solves a problem by consistently creating (I hope, more and more accurate) approximations of some "ideal" solution. Generally speaking, the more iterations, the more accurate ("best") the result will be, but, of course, more complicated calculations need to be performed.

The term cutoff (aka cutoff frequency) is used to denote the method of reducing the size of n-gram language models (as OpenNLP is used, for example, its tag for partial speech). Consider the following example:

 Sentence 1 = "The cat likes mice." Sentence 2 = "The cat likes fish." Bigram model = {"the cat" : 2, "cat likes" : 2, "likes mice" : 1, "likes fish" : 1} 

If you set the cutoff frequency to 1 for this example, the n-gram model will be reduced to

 Bigram model = {"the cat" : 2, "cat likes" : 2} 

That is, the clipping method removes from the language model those n-grams that rarely occur in training data. Sometimes it is necessary to reduce the size of n-gram language models, since the number of even bigrams (not to mention trigrams, 4 grams, etc.) explodes for large cases. Then you can use the information to stop (the number of n-grams) for a statistical estimation of the probability of a word (or its POS tag) taking into account (N-1) the previous word (or POS tags).

+13
source

In the context of the Apache OpenNLP library, we can, in particular, take an example of document categorization for comments, as here.

 positive I love this. I like this. I really love this product. We like this. negative I hate this. I dislike this. We absolutely hate this. I really hate this product. 

The trim value is used to avoid words as an object whose number is less than the trimmed one. If the circumcision was more than 2, then the word “love” cannot be considered as a feature, and we can get the wrong results. Typically, a trimmed value is useful to avoid creating unnecessary functions for words that are rare. A detailed example with additional explanations can be found here in this article .

0
source

Source: https://habr.com/ru/post/987204/


All Articles