I would like to build my own - here I am not sure which one is a tokenizer (from Lucene's point of view) or my own analyzer. I am already writing code that tokens my documents in a word (like List <String> or List <Word>, where Word is a class with only the appearance of a container with 3 public lines: word, pos, lemma - pos stand for the part of speech tag).
I'm not sure what I'm going to index, maybe just "Word.lemma" or something like "Word.lemma + '#' + Word.pos", maybe I will do some filtering from a list of stop words based on parts of speech.
btw here is my misunderstanding: I'm not sure where I should connect to the Lucene API,
should i wrap my tokenizer in a new tokenizer? Should I rewrite TokenStream? Should I assume that this is the task of the analyzer, not the tokenizer? or shoud do i go around everything and directly build my index by adding my word directly to the index using IndexWriter, Fieldable etc.? (if you know any documentation on how to create your own index from scratch while bypassing the analysis process)
Best wishes
EDIT : maybe the easiest way should be with org.apache.commons.lang.StringUtils.join in my Word space with a space in the output of my personal tokenizer / analyzer and rely on WhiteSpaceTokenizer to feed Lucene (and other classic filters) ?
EDIT : so I read the English LemmaTokenizer , denoted by Larsmans ... but where I am still confused is the fact that I complete my own parsing / tokenization process with a complete * List <Word> * (Word class wrapping.form / .pos /.lemma), this process is based on an external binary file that I wrapped in Java (I need to do this / cannot do otherwise - this is not on the consumer point of view, I get a complete list as a result), and I still do not see how should I wrap it again in order to return to the normal Lucene analysis process.
I will also use the TermVector function with TF.IDF, for example, with counting (maybe overriding mine), I may also be interested in finding proximty, thus discarding some words from their part of speech, before providing them to the built-in tokenizer Lucene or an internal analyzer might seem like a bad idea. And I have difficulty thinking of the βrightβ way to wrap Word.form / Word.pos / Word.lemma (or even another Word.anyOtherUnterestingAttribute) to the Lucene path.
EDIT: By the way, here is a piece of code I wrote inspired by one of @Larsmans:
class MyLuceneTokenizer extends TokenStream { private PositionIncrementAttribute posIncrement; private CharTermAttribute termAttribute; private List<TaggedWord> tagged; private int position; public MyLuceneTokenizer(Reader input, String language, String pathToExternalBinary) { super(); posIncrement = addAttribute(PositionIncrementAttribute.class); termAttribute = addAttribute(CharTermAttribute.class);