Lucene.NET: Camel Tokenizer?

Question

Lucene.NET: Camel Tokenizer?

I started playing with Lucene.NET today, and I wrote a simple testing method for indexing and finding source code files. The problem is that standard analyzers / tokens consider the name of the camel source code identifier as the only token.

I'm looking for a method of treatment identifiers camel case, for example MaxWidth, in the three tokens: MaxWidth, maxand width. I was looking for such a tokenizer, but I could not find it. Before you write your own: is there anything in this direction? Or is there a better approach than writing a tokenizer from scratch?

UPDATE: in the end, I decided to get my hands dirty and wrote it myself CamelCaseTokenFilter. I will write a post about this on my blog and I will update the question.

+3

tokenize lucene lucene.net

Igor Brejc 10 sept. '10 at 17:57

source share

3 answers

The link below may be useful for writing a custom tokenizer ...

http://karticles.com/NoSql/lucene_custom_tokenizer.html

+1

vrluckyin Feb 27 '12 at 16:10

source share

Here is my implementation:

package corp.sap.research.indexing;

import java.io.IOException;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class CamelCaseFilter extends TokenFilter {

    private final CharTermAttribute _termAtt;

    protected CamelCaseScoreFilter(TokenStream input) {
        super(input);
        this._termAtt = addAttribute(CharTermAttribute.class);
    }

    @Override
    public boolean incrementToken() throws IOException {
        if (!input.incrementToken())
            return false;
        CharTermAttribute a = this.getAttribute(CharTermAttribute.class);
        String spliettedString = splitCamelCase(a.toString());
        _termAtt.setEmpty();
        _termAtt.append(spliettedString);
        return true;

    }


    static String splitCamelCase(String s) {
           return s.replaceAll(
              String.format("%s|%s|%s",
                 "(?<=[A-Z])(?=[A-Z][a-z])",
                 "(?<=[^A-Z])(?=[A-Z])",
                 "(?<=[A-Za-z])(?=[^A-Za-z])"
              ),
              " "
           );
        }
}

+1

Adir katz Mar 19 '12 at 16:48

source share

Yuval F · Accepted Answer · 2010-09-10T21:23:17+0000

Solr has a WordDelimiterFactory that generates a tokenizer similar to what you need. Perhaps you can translate the source code into C #.

Lucene.NET: Camel Tokenizer?

More articles: