Lucene.NET: Camel Tokenizer?

I started playing with Lucene.NET today, and I wrote a simple testing method for indexing and finding source code files. The problem is that standard analyzers / tokens consider the name of the camel source code identifier as the only token.

I'm looking for a method of treatment identifiers camel case, for example MaxWidth, in the three tokens: MaxWidth, maxand width. I was looking for such a tokenizer, but I could not find it. Before you write your own: is there anything in this direction? Or is there a better approach than writing a tokenizer from scratch?

UPDATE: in the end, I decided to get my hands dirty and wrote it myself CamelCaseTokenFilter. I will write a post about this on my blog and I will update the question.

+3
source share
3 answers

Solr has a WordDelimiterFactory that generates a tokenizer similar to what you need. Perhaps you can translate the source code into C #.

+1
source

The link below may be useful for writing a custom tokenizer ...

http://karticles.com/NoSql/lucene_custom_tokenizer.html

+1
source

Here is my implementation:

package corp.sap.research.indexing;

import java.io.IOException;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class CamelCaseFilter extends TokenFilter {

    private final CharTermAttribute _termAtt;

    protected CamelCaseScoreFilter(TokenStream input) {
        super(input);
        this._termAtt = addAttribute(CharTermAttribute.class);
    }

    @Override
    public boolean incrementToken() throws IOException {
        if (!input.incrementToken())
            return false;
        CharTermAttribute a = this.getAttribute(CharTermAttribute.class);
        String spliettedString = splitCamelCase(a.toString());
        _termAtt.setEmpty();
        _termAtt.append(spliettedString);
        return true;

    }


    static String splitCamelCase(String s) {
           return s.replaceAll(
              String.format("%s|%s|%s",
                 "(?<=[A-Z])(?=[A-Z][a-z])",
                 "(?<=[^A-Z])(?=[A-Z])",
                 "(?<=[A-Za-z])(?=[^A-Za-z])"
              ),
              " "
           );
        }
}
+1
source

Source: https://habr.com/ru/post/1764200/


All Articles