Process that lucene symbolizes text

This can be seen as a general Java issue, but for better understanding, I use Lucene as an example.

You can use various tokenizers in Lucene to tint text. There is the main abstract Tokenizer class, and then many different classes that extend it. The same goes for TokenFilter.

Now it seems that every time you want to index a document, a new tokenizer is created. The question is, Tokeinzer is just a utility class, why not make it static? for example, a Tokenizer that converts all letters to lowercase can have a static method that does just that for every input it inputs. What is the point of creating a new object for each piece of text that we want to index?

One note - Tokeinzer has a private field that contains the input it receives for tokenize. I just don’t understand why we should store it this way, because the object is destroyed immediately after the completion of the tokenization process and a new tokenized text is returned. The only thing I can think of is multi-threaded access, maybe?

Thank!

+3
source share
2 answers

Now it seems that every time you want to index a document, a new tokenizer is created

, Analyzer.reusableTokenStream, Tokenizer, (TokenFilters ..). . http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/analysis/Analyzer.html#reusableTokenStream(java.lang.String, java.io.Reader)

- Tokeinzer , , tokenize. , , . , , , ?

, . , , , (, Tokenizers TokenFilters ). clearAttributes() reset .

, Whitespace TermAttribute ctor LowerCaseFilter, TermAttribute ctor. TermAttributes char []. , Analyzer.reusableTokenStream, TokenStream ( Whitespace, LowerCaseFilter), . reset (Reader), . , reset() , , , incrementToken() false.

+6

, - , , w/Lucene. , , . tokeniser - , , . , .

0

Source: https://habr.com/ru/post/1785278/


All Articles