Custom Lucene 3.5 Payloads

Question

Custom Lucene 3.5 Payloads

When working with the Lucene index, I have a standard document format that looks something like this:

Name: John Doe Job: Plumber Hobby: Fishing

My goal is to add a payload to the task field, which will contain additional information about plumbing, for example, a link to wikipedia on an article on plumbing. I do not want to place the payload elsewhere. I initially found an example that covered what I would like to do, but used Lucene 2.2 and has no updates to reflect the changes in the api token stream. After some research, I came up with this little monster to create a custom token stream for this field.

 public static TokenStream tokenStream(final String fieldName, Reader reader, Analyzer analyzer, final String item) { final TokenStream ts = analyzer.tokenStream(fieldName, reader) ; TokenStream res = new TokenStream() { CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); PayloadAttribute payAtt = addAttribute(PayloadAttribute.class); public boolean incrementToken() throws IOException { while(true) { boolean hasNext = ts.incrementToken(); if(hasNext) { termAtt.append("test"); payAtt.setPayload(new Payload(item.getBytes())); } return hasNext; } } }; return res; }

When I take a stream of tokens and iterate over all the results, before adding it to the field, I see that it successfully connected the term and the payload. After calling reset () in the stream, I add it to the document field and index the document. However, when I print the document and look at the index with Luke, my user token did not make a cut. The field name is displayed correctly, but the term value from the token stream is not displayed and does not indicate successful attachment of the payload.

This leads me to two questions. Firstly, did I use the token correctly, and if so, why doesn't it do tokenization when I add it to the field? Secondly, if I did not use the stream correctly, I need to write my own analyzer. This example was generated using the standard Lucene parser to generate a stream of tokens and write a document. I would like to avoid writing my own analyzer, if possible, because I only want to add the payload to one field!

Edit:

Call code

 TokenStream ts = tokenStream("field", new StringReader("value"), a, docValue); CharTermAttribute cta = ts.getAttribute(CharTermAttribute.class); PayloadAttribute payload = ts.getAttribute(PayloadAttribute.class); while(ts.incrementToken()) { System.out.println("Term = " + cta.toString()); System.out.println("Payload = " + new String(payload.getPayload().getData())); } ts.reset();

+4

java tokenize lucene

Floppydisk Feb 03 '12 at 14:53

source share

2 answers

I could have missed something, but ... You do not need a custom tokenizer to associate additional information with a Lucene document. Just store as an unanalyzable field.

 doc.Add(new Field("fname", "Joe", Field.Store.YES, Field.Index.ANALYZED)); doc.Add(new Field("job", "Plumber", Field.Store.YES, Field.Index.ANALYZED)); doc.Add(new Field("link","http://www.example.com", Field.Store.YES, Field.Index.NO));

Then you can get the link field, just like any other field.

In addition, if you need a custom tokenizer, you definitely need a special analyzer to implement it, both for constructing the index and for searching.

0

Sharbarb Feb 09 '12 at 20:16

source share

Artur nowak · Accepted Answer · 2012-02-09T20:06:25+0000

It is very difficult to understand why the payload is not saved, the reason may be in the code that uses the method you proposed.

The most convenient way to set payloads is in TokenFilter - I think that using this approach will give you much cleaner code and, in turn, will make your script correct. I think this is most indicative to take a look at any filter of this type in the Lucene source, for example. TokenOffsetPayloadTokenFilter . You can find an example of how it should be used in the test for this class .

Also consider whether there is a better place to store these hyperlinks than in useful data. Payloads have a very special application, for example. increasing certain terms depending on their location or formatting in the original document, part of speech ... Their main goal is to influence how the search is performed, therefore they are usually numerical values, are effectively packaged to reduce the size of the index.

Custom Lucene 3.5 Payloads

More articles: