Lucene: Filtering Non-Term Documents

I have an index whose documents have two fields (in fact, it is more than 800 fields, but other fields will not affect us here):

  • The contents field contains the parsed / symbolized text of the document. The query string is executed in this field.
  • The category field contains the identifier of a particular category of the document. There are about 2500 different categories, and a document can take place in several of them (that is, a document can have several category entries. Results are filtered by this field.

The index contains about 20 million documents and has a size of 5 GB.

A pointer is requested with a custom query string plus an additional set of several categories that the user is not interested in. Question : how can I delete those documents that correspond not only to the query string, but also to unwanted categories.

I could use BooleanQuery with a MUST_NOT , i.e. something like that:

 BooleanQuery q = new BooleanQuery(); q.add(contentQuery, BooleanClause.MUST); for (String unwanted: unwantedCategories) { q.add(new TermsQuery(new Term("category", unwanted), BooleanClause.MUST_NOT); } 

Is there a way to do this with Lucene filters? Performance here is a problem, and there will only be a few recurring options for unwantedCategories , so CachingWrapperFilter will probably help a lot. In addition, because Lucene queries are generated in an existing code base, it is difficult to accommodate this, while an optional Filter can be easily entered.

In other words, How to create a Filter based on what conditions _not_ should appear in the document?

+4
source share
2 answers

One answer to the word: BooleanFilter , found it a few minutes after the question was formulated:

 BooleanFilter f = new BooleanFilter(); for (String unwanted: unwantedCategories) { TermsFilter tf = new TermsFilter(new Term("category", unwanted)); f.add(new FilterClause(tf, BooleanClause.MUST_NOT)); } 
+7
source

You can use QueryWrapperFilter to turn an arbitrary query into a filter. And you can use CachingWrapperFilter to cache any filter. So something like:

 BooleanQuery bq = new BooleanQuery(); // set up bq Filter myFilter = new CachingWrapperFilter ( new QueryWrapperFilter (bq) ); 
0
source

Source: https://habr.com/ru/post/1332745/


All Articles