SQLite: efficient substring search in a large table

I am developing an Android application that should search for a substring in a large table (about 500,000 entries with street and place names, so only a few words per entry).

CREATE TABLE Elements (elementID INTEGER, type INTEGER, name TEXT, data BLOB) 

Note that only 20% of all records contain rows in the "name" column.

The following query takes almost 2 minutes to complete:

 SELECT elementID, name FROM Elements WHERE name LIKE %foo% 

Now I tried to use FTS3 to speed up the request. It was quite successful, the query time was reduced to 1 minute (surprisingly, the database file size increased by only 5%, which is also good for my purpose).

The problem is that FTS3 does not seem to support substring search, i.e. if I want to find the “bar” in “foo bar” and “foobar”, I get only “foo bar”, although I need both results.

So, I have two questions:

  • Is further query acceleration possible? My goal is 30 seconds to query, but I don't know if this is realistic ...

  • How to get a real substring search using FTS3?

+6
source share
4 answers

Solution 1: If you can make each character in your database as a separate word, you can use phrase queries to search for a substring.

For example, suppose that my_table contains a single human column:

 person ------ John Doe Jane Doe 

you can change it to

 person ------ J ohn D oe J ane D oe 

To search for the substring "ohn" use the phrase query:

 SELECT * FROM my_table WHERE person MATCH '"ohn"' 

Beware that “JohnD” matches “John Doe,” which may be undesirable. To fix this, change the space character in the source line to something else.

For example, you can replace the space character "$":

 person ------ J ohn $ D oe J ane $ D oe 

Solution 2: Following the idea of ​​solution 1, you can make each character a separate word using a custom tokenizer and use phrase queries to query substrings.

The advantage over solution 1 is that you do not need to add spaces to your data, which can unnecessarily increase the size of the database.

The downside is that you have to implement a custom tokenizer. Fortunately, I have one of them for you . The code is in C, so you need to figure out how to integrate it with your Java code.

+9
source

You should add an index to the name column in your database, which should speed up the query significantly.

I believe SQLite3 supports a substring corresponding to this:

 SELECT * FROM Elements WHERE name MATCH '*foo*'; 

http://www.sqlite.org/fts3.html#section_3

+3
source

not sure what speed it up, since you are using sqllite, but to do the substring I did things like

 SET @foo_bar = 'foo bar' SELECT * FROM table WHERE name LIKE '%' + REPLACE(@foo_bar, ' ', '%') + '%' 

of course, this only returns records that have the word "foo" before the word "bar".

-1
source

I am facing something similar to your problem. Here is my suggestion to try creating a translation table that translates all words into numbers. Then find numbers instead of words.

Please let me know if this helps.

-1
source

Source: https://habr.com/ru/post/919701/


All Articles