Slow performance when querying / sorting the top 1000 documents in a 17 millionth Lucene.NET index document

I want to find the top 1000 documents in the Lucene.NET index that match the type and tag, sorted by number of views. My index contains 17 million documents. I am looking for type 'entity' and tag 'business'. Almost every document in the index right now has an entity type and a business tag. Both are string fields that are indexed but not parsed and do not have a condition vector. Right now, it takes about 15-20 seconds to get the results of this query.

Here is my code:

string subType = "entity"; string tag = "business"; BooleanQuery filterQuery = new BooleanQuery(); filterQuery.Add(new BooleanClause(new TermQuery(new Term("SubType", subType)), BooleanClause.Occur.MUST)); filterQuery.Add(new BooleanClause(new TermQuery(new Term("Tag", tag)), BooleanClause.Occur.MUST)); Sort sort = new Sort(new SortField("Views", global::Lucene.Net.Search.SortField.INT, true)); Filter queryFilter = new QueryWrapperFilter(filterQuery); TopDocs docs = searcher.Search(new MatchAllDocsQuery(), queryFilter, 1000, sort); 

Any suggestions for improving productivity are welcome. I spent about 8 hours setting up and communicating with things. Right now I am caching the results for 15 minutes so that future searches can simply return the caching results, but this initial search is just painfully slow.

It seems that sets for the terms "entity" and "business" should be compressed to several bytes, assuming that Lucene does any encoding in length ...

+4
source share
2 answers

My solution was to use SQL to sort when the category frequency is at a certain threshold (10% or more) and use Lucene for any category that makes up less than 10% of documents. This greatly improved performance because SQL is able to sort well when there is a high frequency (probably because it can walk with both the Views index and the Category index to quickly get the first 5000). SQL is slow when only a smaller percentage of items matches a category, probably because it should go through almost the entire Views index. Lucene is fast because it can filter only items with a category very quickly and then sort them by points.

+1
source

Verification of reality: according to your specifications, you order 16830000 documents (16.83 million) to withdraw the top 1000.

This means that they must be loaded enough for sorting.

Yes. It will be slow.

CUT THE ROOMS. Anything that returns more than 10,000 items is definitely worth sorting.

+5
source

Source: https://habr.com/ru/post/1379168/


All Articles