Java lucene custom analyzer and the problem of creating tokens in termvector offsets?

Question

Java lucene custom analyzer and the problem of creating tokens in termvector offsets?

I had a problem with the lucene termvector offsets, which when I parsed the field with my custom analyzer, it will give the wrong offsets for the termvector, but this is normal with the standard analyzer, here is my analyzer code

public class AttachmentNameAnalyzer extends Analyzer { private boolean stemmTokens; private String name; public AttachmentNameAnalyzer(boolean stemmTokens, String name) { super(); this.stemmTokens = stemmTokens; this.name = name; } @Override public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream stream = new AttachmentNameTokenizer(reader); if (stemmTokens) stream = new SnowballFilter(stream, name); return stream; } @Override public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException { TokenStream stream = (TokenStream) getPreviousTokenStream(); if (stream == null) { stream = new AttachmentNameTokenizer(reader); if (stemmTokens) stream = new SnowballFilter(stream, name); setPreviousTokenStream(stream); } else if (stream instanceof Tokenizer) { ( (Tokenizer) stream ).reset(reader); } return stream; } }

What happened to this "Help needed"

+6

java lucene analyzer

Badr Jun 09 '11 at 13:14

source share

2 answers

What version of Lucene are you using? I look for a 3x branch and behavior changes with each version.

You can check the code for public final boolean incrementToken() where offset calculated.

I also see this:

 /** * <p> * As of Lucene 3.1 the char based API ({@link #isTokenChar(char)} and * {@link #normalize(char)}) has been depreciated in favor of a Unicode 4.0 * compatible int based API to support codepoints instead of UTF-16 code * units. Subclasses of {@link CharTokenizer} must not override the char based * methods if a {@link Version} >= 3.1 is passed to the constructor. * <p> * <p> * NOTE: This method will be marked <i>abstract</i> in Lucene 4.0. * </p> */

btw, you can rewrite the switch statement, for example

 @Override protected boolean isTokenChar(int c) { switch(c) { case ',': case '.': case '-': case '_': case ' ': return false; default: return true; } }

0

c00kiemon5ter Jun 11 '11 at 15:21

source share

Badr · Accepted Answer · 2011-07-04T08:00:20+0000

a problem with the analyst, when I sent the code to the analyzer earlier, in fact, the token stream should be at rest for each new text entry that should be marked.

  public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException { TokenStream stream = (TokenStream) getPreviousTokenStream(); if (stream == null) { stream = new AttachmentNameTokenizer(reader); if (stemmTokens) stream = new SnowballFilter(stream, name); setPreviousTokenStream(stream); // ---------------> problem was here } else if (stream instanceof Tokenizer) { ( (Tokenizer) stream ).reset(reader); } return stream; }

every time I set the previous stream of tokens, the next following text field should be separately marked, it always starts with the final offset of the last token stream, which makes the vector vector offset incorrect for the new stream, now it works fine

 ublic TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException { TokenStream stream = (TokenStream) getPreviousTokenStream(); if (stream == null) { stream = new AttachmentNameTokenizer(reader); if (stemmTokens) stream = new SnowballFilter(stream, name); } else if (stream instanceof Tokenizer) { ( (Tokenizer) stream ).reset(reader); } return stream; }

Java lucene custom analyzer and the problem of creating tokens in termvector offsets?

More articles: