Create an index to search for a substring?

I want to do a general substring search among billions of strings. The requirement is slightly different from the general full-text search, because I want the ubst query to also get into substr.

Can Lucene or Sphinx do this? If not, what do you think, how to do it?

+6
source share
3 answers

The best index structure for this case is the Lucene suffix tree does not implement this type of index, so substring searching is slow. But lucene has a prefix tree index, which means you can quickly search if you are looking for conditions by their prefix.

+4
source

Lucene is one of the best options available. Lucene supports substring search, so ubst will return substr.

check out http://wiki.apache.org/lucene-java/LuceneImplementations for a suitable language implementation.

0
source

Sphinx supports efficient substring searches since version 2.0.1-beta, April 22, 2011. Unfortunately, to date, this support applies only to the beta version, as mentioned here .

I tried with beta 2.1.1. It seems to be working correctly. See manual entry for dictionary type, read about keywords type.

When I tried to use version 2.0.6, it returned to the inefficient crc index, indicating the following warning when indexing:

 WARNING: min_infix_len is not supported yet with dict=keywords; using dict=crc 

My minimal configuration file:

 source sour { type = xmlpipe2 xmlpipe_command = type C:\Temp\1\sphinx\input.xml } index inde { source = sour path = testpa enable_star = 1 dict = keywords charset_type = utf-8 min_infix_len = 1 } 
0
source

Source: https://habr.com/ru/post/893654/


All Articles