Performance Issues Lucene 4.x

Over the past few weeks, I have been working on updating the application from Lucene 3.x to Lucene 4.x in the hope of improving performance. Unfortunately, after going through the complete migration and game process with all the tricks I found on the Internet and in the documentation, Lucene 4 works much slower than Lucene 3 (~ 50%). At the moment, I hardly understand ideas, and I was wondering if anyone had any suggestions on how to bring it up to speed. I'm not even looking for a better improvement over 3.x; I would be happy to just pick it up and stay in the current release of Lucene.

<& Edit ET;

To confirm that none of the standard migration changes had a negative impact on performance, I put my version of Lucene 4.x in Lucene 3.6.2 and saved the new API, rather than using the custom ParallelMultiSearcher and other deprecated methods / classes.

Performance in 3.6.2 is even faster than before:

  • Old application (Lucene 3.6.0) - ~ 5700 requests / min
  • Updated application with a new API and some minor optimizations (Lucene 4.4.0) - ~ 2900 requests / min
  • The new version of the application is moved back, but the optimization and the new interface IndexSearcher / etc (Lucene 3.6.2) are preserved - ~ 6200 requests / min

Since optimizing and using the new Lucene API actually improved performance by 3.6.2, it makes no sense to be a problem with anything other than Lucene. I just don’t know what else I can change in my program to fix this.

</ & Edit ET;

Application info

  • We have one index, divided into 20 shards - this provided better performance in both Lucene 3.x and Lucene 4.x

  • Currently, the index contains ~ 150 million documents, all of which are quite simple and largely normalized, so there are many duplicate tokens. Only one field (identifier) ​​is saved - the rest cannot be restored.

  • We have a fixed set of relatively simple queries that are populated with user input and execution - they consist of several BooleanQueries, TermQueries and TermRangeQueries. Some of them are now nested, but only one level.

  • We do nothing with the results - we just get the ratings and stored identifier fields

  • We use MMapDirectories pointing to index files in tmpfs. We played “use hack” with useUnmap, since we don’t open new directories very often and got a good boost from this

  • We use one IndexSearcher for all queries.

  • Our test machines have 94 GB of RAM and 64 logical cores.

General processing

1) Request received while listening to sockets

2) Up to 4 request objects are generated and filled with normalized user input (all necessary input data for the request must be present or not executed)

3) Requests are executed in parallel using the Fork / Join infrastructure

  • Subqueries for each shard are executed in parallel using IndexSearcher w / ExecutorService

4) Aggregation and simple post-processing

Other relevant information

  • Indexes were recreated for the 4.x system, but the data is the same. We tried the regular Lucene42 codec, as well as the extended one, which did not use compression (on an offer on the Internet).

  • In 3.x we used a modified version of ParallelMultisearcher, in 4.x we use IndexSearcher with ExecutorService and we unite all our readers in MultiReader

  • In 3.x, we used ThreadPoolExecutor instead of Fork / Join (Fork / Join performed better in my tests)

4.x Hotspots

Method | Independent time (%) | Self time (ms) | Battery Life (CPU in ms)

java.util.concurrent.CountDownLatch.await () | 11.29% | 140887,219 | 0.0 <- this is only from tcp threads waiting for the completion of real work - you can ignore it
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader $ BlockDocsEnum. <init> () | 9.74% | 21594.03 | 121594
org.apache.lucene.codecs.BlockTreeTerReader $ FieldReader $ SegmentTermsEnum $ Frame. <init> () | 9.59% | 119680.956 | 119680
org.apache.lucene.codecs.lucene41.ForUtil.readBlock () | 6.91% | 86208.621 | 86208
org.apache.lucene.search.DisjunctionScorer.heapAdjust () | 6.68% | 83332,525 | 83332
java.util.concurrent.ExecutorCompletionService.take () | 5.29% | 66081,499 | 6153
org.apache.lucene.search.DisjunctionSucorer.afterNext () | 4.93% | 61560,872 | 61560
org.apache.lucene.search.Tercorer.advance () | 4.53% | 56530,752 | 56530
java.nio.DirectByteBuffer.get () | 3.96% | 49470.349 | 49470
org.apache.lucene.codecs.BlockTreeTerReader $ FieldReader $ SegmentTerEnum. <init> () | 2.97% | 37051,644 | 37051
org.apache.lucene.codecs.BlockTreeTerReader $ FieldReader $ SegmentTerEnum.getFrame () | 2.77% | 34576.54 | 34576
org.apache.lucene.codecs.MultiLevelSkipListReader.skipTo () | 2.47% | 30767.711 | 30767
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.newTertate () | 2.23% | 27782,522 | 27782
java.net.ServerSocket.accept () | 2.19% | 27380,696 | 0.0
org.apache.lucene.search.DisjunctionSucorer.advance () | 1.82% | 22775.325 | 22775
org.apache.lucene.search.HitQueue.getSentinelObject () | 1.59% | 19869.871 | 19869
org.apache.lucene.store.ByteBufferIndexInput.buildSlice () | 1.43% | 17861.148 | 17861
org.apache.lucene.codecs.BlockTreeTerReader $ FieldReader $ SegmentTerEnum.getArc () | 1.35% | 16813.927 | 16813
org.apache.lucene.search.DisjunctionSucorer.countMatches () | 1.25% | 15603,283 | 15603
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader $ BlockDocsEnum.refillDocs () | 1.12% | 13929,646 | 13929
java.util.concurrent.locks.ReentrantLock.lock () | 1.05% | 13145,631 | 8618
org.apache.lucene.util.PriorityQueue.downHeap () | 1.00% | 12513.406 | 12513
java.util.TreeMap.get () | 0.89% | 11070.192 | 11070
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.docs () | 0.80% | 10026,117 | 10026
org.apache.lucene.codecs.BlockTreeTerReader $ FieldReader $ SegmentTerEnum $ Frame.decodeMetaData () | 0.62% | 7746.05 | 7746
org.apache.lucene.codecs.BlockTreeTerReader $ FieldReader.iterator () | 0.60% | 7482.395 | 7482
org.apache.lucene.codecs.BlockTreeTerReader $ FieldReader $ SegmentTerEnum.seekExact () | 0.55% | 6863.069 | 6863
org.apache.lucene.store.DataInput.clone () | 0.54% | 6721,357 | 6721
java.nio.DirectByteBufferR.duplicate () | 0.48% | 5930,226 | 5930
org.apache.lucene.util.fst.ByteSequenceOutputs.read () | 0.46% | 5708,354 | 5708
org.apache.lucene.util.fst.FST.findTargetArc () | 0.45% | 5601.63 | 5601
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock () | 0.45% | 5567,914 | 5567
org.apache.lucene.store.ByteBufferIndexInput.toString () | 0.39% | 4889.302 | 4889
org.apache.lucene.codecs.lucene41.Lucene41SkipReader. <init> () | 0.33% | 4147.285 | 4147
org.apache.lucene.search.TermQuery $ TermWeight.scorer () | 0.32% | 4045.912 | 4045
org.apache.lucene.codecs.MultiLevelSkipListReader. <init> () | 0.31% | 3890.399 | 3890
org.apache.lucene.codecs.BlockTreeTermsReader $ FieldReader $ SegmentTermsEnum $ Frame.loadBlock () | 0.31% | 3886.194 | 3886


If there is any other information that you could use, this might help, let me know.

+4
source share
1 answer

For those who care or are trying to do something similar (controlled by parallelism in the query), the problem was that IndexSearcher was creating a task for a segment for each shard, and not a task for each shard - I read Javadoc incorrectly.

I solved the problem using forceMerge (1) on my shards to limit the number of extra threads. In my use case, this is not very important, since I am not currently using NRT search, but it still adds unnecessary complexity to the update process + subordinate synchronization, so I'm looking for ways to avoid forceMerge.

As a quick fix, I’ll probably just expand IndexSearcher and ask him to create a stream for each reader instead of a stream per segment, but the idea of ​​a “virtual segment” was raised on the Lucene mailing list. That would be a much longer term solution.

If you want to see more information, you can follow it on the lucene mailing list: http://www.mail-archive.com/ java-user@lucene.apache.org /msg42961.html

+2
source

Source: https://habr.com/ru/post/1502913/


All Articles