Tokenization and indexing with Lucene, how to handle external tokenization and part of speech?

Question

Tokenization and indexing with Lucene, how to handle external tokenization and part of speech?

I would like to build my own - here I am not sure which one is a tokenizer (from Lucene's point of view) or my own analyzer. I am already writing code that tokens my documents in a word (like List <String> or List <Word>, where Word is a class with only the appearance of a container with 3 public lines: word, pos, lemma - pos stand for the part of speech tag).

I'm not sure what I'm going to index, maybe just "Word.lemma" or something like "Word.lemma + '#' + Word.pos", maybe I will do some filtering from a list of stop words based on parts of speech.

btw here is my misunderstanding: I'm not sure where I should connect to the Lucene API,

should i wrap my tokenizer in a new tokenizer? Should I rewrite TokenStream? Should I assume that this is the task of the analyzer, not the tokenizer? or shoud do i go around everything and directly build my index by adding my word directly to the index using IndexWriter, Fieldable etc.? (if you know any documentation on how to create your own index from scratch while bypassing the analysis process)

Best wishes

EDIT : maybe the easiest way should be with org.apache.commons.lang.StringUtils.join in my Word space with a space in the output of my personal tokenizer / analyzer and rely on WhiteSpaceTokenizer to feed Lucene (and other classic filters) ?

EDIT : so I read the English LemmaTokenizer , denoted by Larsmans ... but where I am still confused is the fact that I complete my own parsing / tokenization process with a complete * List <Word> * (Word class wrapping.form / .pos /.lemma), this process is based on an external binary file that I wrapped in Java (I need to do this / cannot do otherwise - this is not on the consumer point of view, I get a complete list as a result), and I still do not see how should I wrap it again in order to return to the normal Lucene analysis process.

I will also use the TermVector function with TF.IDF, for example, with counting (maybe overriding mine), I may also be interested in finding proximty, thus discarding some words from their part of speech, before providing them to the built-in tokenizer Lucene or an internal analyzer might seem like a bad idea. And I have difficulty thinking of the “right” way to wrap Word.form / Word.pos / Word.lemma (or even another Word.anyOtherUnterestingAttribute) to the Lucene path.

EDIT: By the way, here is a piece of code I wrote inspired by one of @Larsmans:

class MyLuceneTokenizer extends TokenStream { private PositionIncrementAttribute posIncrement; private CharTermAttribute termAttribute; private List<TaggedWord> tagged; private int position; public MyLuceneTokenizer(Reader input, String language, String pathToExternalBinary) { super(); posIncrement = addAttribute(PositionIncrementAttribute.class); termAttribute = addAttribute(CharTermAttribute.class); // TermAttribute is deprecated! // import com.google.common.io.CharStreams; text = CharStreams.toString(input); //see http://stackoverflow.com/questions/309424/in-java-how-do-i-read-convert-an-inputstream-to-a-string tagged = MyTaggerWrapper.doTagging(text, language, pathToExternalBinary); position = 0; } public final boolean incrementToken() throws IOException { if (position > tagged.size() -1) { return false; } int increment = 1; // will probably be changed later depending upon any POS filtering or insertion @ same place... String form = (tagged.get(position)).word; String pos = (tagged.get(position)).pos; String lemma = (tagged.get(position)).lemma; // logic filtering should be here... // BTW we have broken the idea behing the Lucene nested filters or analyzers! String kept = lemma; if (kept != null) { posIncrement.setPositionIncrement(increment); char[] asCharArray = kept.toCharArray(); termAttribute.copyBuffer(asCharArray, 0, asCharArray.length); //termAttribute.setTermBuffer(kept); position++; } return true; } } class MyLuceneAnalyzer extends Analyzer { private String language; private String pathToExternalBinary; public MyLuceneAnalyzer(String language, String pathToExternalBinary) { this.language = language; this.pathToExternalBinary = pathToExternalBinary; } @Override public TokenStream tokenStream(String fieldname, Reader input) { return new MyLuceneTokenizer(input, language, pathToExternalBinary); } }

+6

java tokenize nlp lucene

user1340802 May 18 '12 at 8:55

source share

2 answers

Fred foo · Answer 1 · 2012-05-18T09:02:36+0000

There are various options here, but when I tried to wrap the POS marker tag in Lucene, I found that the easiest option is to implement the new TokenStream and wrap inside the new Analyzer . In any case, the offset from IndexWriter directly seems to be bad. You can find my code on my github .

Renaud · Answer 2 · 2012-09-10T07:38:06+0000

If you want to use UIMA, Salmon Run has an example . But within the Lucene contrib modules, there is an effort to enable UIMA workflows, see here and.

Tokenization and indexing with Lucene, how to handle external tokenization and part of speech?

More articles: