I have an index whose documents have two fields (in fact, it is more than 800 fields, but other fields will not affect us here):
- The
contents field contains the parsed / symbolized text of the document. The query string is executed in this field. - The
category field contains the identifier of a particular category of the document. There are about 2500 different categories, and a document can take place in several of them (that is, a document can have several category entries. Results are filtered by this field.
The index contains about 20 million documents and has a size of 5 GB.
A pointer is requested with a custom query string plus an additional set of several categories that the user is not interested in. Question : how can I delete those documents that correspond not only to the query string, but also to unwanted categories.
I could use BooleanQuery with a MUST_NOT , i.e. something like that:
BooleanQuery q = new BooleanQuery(); q.add(contentQuery, BooleanClause.MUST); for (String unwanted: unwantedCategories) { q.add(new TermsQuery(new Term("category", unwanted), BooleanClause.MUST_NOT); }
Is there a way to do this with Lucene filters? Performance here is a problem, and there will only be a few recurring options for unwantedCategories , so CachingWrapperFilter will probably help a lot. In addition, because Lucene queries are generated in an existing code base, it is difficult to accommodate this, while an optional Filter can be easily entered.
In other words, How to create a Filter based on what conditions _not_ should appear in the document?
source share