Lucene: The fastest way to return phrase documentation?

I am trying to use Lucene (actually PyLucene!) To find out how many documents my exact phrase contains. Currently my code looks like this ... but it works quite slowly. Does anyone know a faster way to return the number of documents?

phraseList = ["some phrase 1", "some phrase 2"] #etc, a list of phrases...

countsearcher = IndexSearcher(SimpleFSDirectory(File(STORE_DIR)), True)
analyzer = StandardAnalyzer(Version.LUCENE_CURRENT)

for phrase in phraseList:
     query = QueryParser(Version.LUCENE_CURRENT, "contents", analyzer).parse("\"" + phrase + "\"")
     scoreDocs = countsearcher.search(query, 200).scoreDocs
     print "count is: " + str(len(scoreDocs))
+3
source share
1 answer

Typically, writing a user deletion collector is the fastest way to count the number of hits using a bit set, as shown in the javadoc Collector .

Another method is to get TopDocs with the number of results indicated as one.

TopDocs topDocs = searcher.search(query, filter, 1);

topDocs.totalHits . , , , .

Java. Python.

+6

Source: https://habr.com/ru/post/1744655/


All Articles