Lucene: The fastest way to return phrase documentation?

Question

Lucene: The fastest way to return phrase documentation?

I am trying to use Lucene (actually PyLucene!) To find out how many documents my exact phrase contains. Currently my code looks like this ... but it works quite slowly. Does anyone know a faster way to return the number of documents?

phraseList = ["some phrase 1", "some phrase 2"] #etc, a list of phrases...

countsearcher = IndexSearcher(SimpleFSDirectory(File(STORE_DIR)), True)
analyzer = StandardAnalyzer(Version.LUCENE_CURRENT)

for phrase in phraseList:
     query = QueryParser(Version.LUCENE_CURRENT, "contents", analyzer).parse("\"" + phrase + "\"")
     scoreDocs = countsearcher.search(query, 200).scoreDocs
     print "count is: " + str(len(scoreDocs))

+3

python search lucene

Michael manoochehri May 09 '10 at 5:00

source share

1 answer

Shashikant Kore · Accepted Answer · 2010-05-09T07:49:13+0000

Typically, writing a user deletion collector is the fastest way to count the number of hits using a bit set, as shown in the javadoc Collector .

Another method is to get TopDocs with the number of results indicated as one.

TopDocs topDocs = searcher.search(query, filter, 1);

topDocs.totalHits . , , , .

Java. Python.

Lucene: The fastest way to return phrase documentation?

More articles: