How to use ASCIIFoldingFilter in my Lucene application?

I have a standard Lucene application that is looking from an index. My index contains many French terms, and I would like to use ASCIIFoldingFilter.

I searched many times, and I have no idea how to use it. The constructor accepts the TokenStream object, am I calling a method on the analyzer that retrieves the TokenStream when you send it a field? Then what should I do? Can someone give me an example of using TokenFilter? Thank.

+3
source share
2 answers

Token filters, such as ASCIIFoldingFilter, are based on TokenStream, so they are what Analyzer returns mainly using the following method:

public abstract TokenStream tokenStream(String fieldName, Reader reader);

As you noticed, filters accept TokenStream as input. They act like wrappers or, more correctly, say like decorators for their input. This means that they improve the behavior of the contained TokenStream, performing both their work and the work of the contained input.

Here you can find an explanation here . It is not directly related to the ASCIIFoldingFilter, but the same principle applies. Basically, you create a custom analyzer with something like this in it (stripped down example):

public class CustomAnalyzer extends Analyzer {
  // other content omitted
  // ...
  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new StandardTokenizer(reader);
    result = new StandardFilter(result);
    result = new LowerCaseFilter(result);
    // etc etc ...
    result = new StopFilter(result, yourSetOfStopWords);
    result = new ASCIIFoldingFilter(result);
    return result;
  }
  // ...
}

Both TokenFilter and Tokenizer are subclasses of TokenStream .

, , .

+10

Analyzer, , . tokenStream final (v4.9.0). :

// Accent insensitive analyzer
public class AccentInsensitiveAnalyzer extends StopwordAnalyzerBase {
    public AccentInsensitiveAnalyzer(Version matchVersion){
        super(matchVersion, StandardAnalyzer.STOP_WORDS_SET);
    }

    @Override
    protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
        final Tokenizer source = new StandardTokenizer(matchVersion, reader);

        TokenStream tokenStream = source;
        tokenStream = new StandardFilter(matchVersion, tokenStream);
        tokenStream = new LowerCaseFilter(tokenStream);
        tokenStream = new StopFilter(matchVersion, tokenStream, getStopwordSet());
        tokenStream = new ASCIIFoldingFilter(tokenStream);
        return new TokenStreamComponents(source, tokenStream);
    }
}
+4

Source: https://habr.com/ru/post/1756989/


All Articles