When working with the Lucene index, I have a standard document format that looks something like this:
Name: John Doe Job: Plumber Hobby: Fishing
My goal is to add a payload to the task field, which will contain additional information about plumbing, for example, a link to wikipedia on an article on plumbing. I do not want to place the payload elsewhere. I initially found an example that covered what I would like to do, but used Lucene 2.2 and has no updates to reflect the changes in the api token stream. After some research, I came up with this little monster to create a custom token stream for this field.
public static TokenStream tokenStream(final String fieldName, Reader reader, Analyzer analyzer, final String item) { final TokenStream ts = analyzer.tokenStream(fieldName, reader) ; TokenStream res = new TokenStream() { CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); PayloadAttribute payAtt = addAttribute(PayloadAttribute.class); public boolean incrementToken() throws IOException { while(true) { boolean hasNext = ts.incrementToken(); if(hasNext) { termAtt.append("test"); payAtt.setPayload(new Payload(item.getBytes())); } return hasNext; } } }; return res; }
When I take a stream of tokens and iterate over all the results, before adding it to the field, I see that it successfully connected the term and the payload. After calling reset () in the stream, I add it to the document field and index the document. However, when I print the document and look at the index with Luke, my user token did not make a cut. The field name is displayed correctly, but the term value from the token stream is not displayed and does not indicate successful attachment of the payload.
This leads me to two questions. Firstly, did I use the token correctly, and if so, why doesn't it do tokenization when I add it to the field? Secondly, if I did not use the stream correctly, I need to write my own analyzer. This example was generated using the standard Lucene parser to generate a stream of tokens and write a document. I would like to avoid writing my own analyzer, if possible, because I only want to add the payload to one field!
Edit:
Call code
TokenStream ts = tokenStream("field", new StringReader("value"), a, docValue); CharTermAttribute cta = ts.getAttribute(CharTermAttribute.class); PayloadAttribute payload = ts.getAttribute(PayloadAttribute.class); while(ts.incrementToken()) { System.out.println("Term = " + cta.toString()); System.out.println("Payload = " + new String(payload.getPayload().getData())); } ts.reset();