Multiple Keyword Search in Java

I have a Java-based application and a set of keywords in a MySQL database (only about 3 M keywords, each of which can include more than one word, for example, it can be: “memory”, “old house”, European Union Law "etc.).

The user interacts with the application by downloading a document with arbitrary text (several pages in most cases). I want to search if and where in the document any of the 3 million keywords appears.

I tried using a loop and looking for a document for each keyword, but this is inefficient at all. I am wondering if there is a library to do the search in a more efficient way.

I would really appreciate any help.

+6
source share
3 answers

project Apache Lucene may be helpful.

Apache LuceneTM is a high-performance, full-featured text search library written entirely in Java. This technology is suitable for almost any application that requires full-text search, especially cross-platform.

You can find helpful tips here.

+5
source

You can try using the flowering filter http://en.wikipedia.org/wiki/Bloom_filter . Then check each word against the flower filter to find out positive results. Remember that there may be false positives. Therefore, if there are positive effects from the flower filter, you can try an sql query, for example, "select a keyword from a keyword, where the keyword is in (positives from bloom filter)" to specifically determine which keywords are present in the loaded document.

A Java implementation of the flowering filter is available in the Guava library. http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/hash/BloomFilter.html

+1
source

You can use the Lemur Project , also available at sourceforge :

The Lemur project develops search engines, browser toolbars, text analysis tools and data resources that support research and development of information search and text development software, including the Indri search engine and the ClueWeb09 dataset.

And as recommended by Taher Apache Lucene is a good tool, and I used both of them and they are great.

+1
source

Source: https://habr.com/ru/post/981907/


All Articles