How do full-text indexes (or caches) work?

I wonder how full-text search systems are being introduced to be able to query millions of records very quickly? Please note: I'm not talking about systems that symbolize content, dividing it into spaces, but about a system that is capable of requesting even parts from the middle of tokens (which is a real challenge).

Background information
I experimented with the cacher home string (using Java), which is able to search for strings specified by a substring as a query. Substring is not required . to be at the beginning of potential extracted lines.

It works with a huge array of strings. Caching is done with TreeMap<Character,TreeSet<String>>.

Adding a record
For each unique character in the string to be added:
Get the set for this character and add a string to it.

Example: "test" is first divided into "t", "e", "s".
Then we extract the sets for these three keys and add a “test” to each of the sets.

Querieng
A query is performed by splitting the query into unique characters, get for each character a Set<String>, construct the intersection of all sets, and finally find the intersection using contains()to make sure the order of the characters in the query is correct.

Benchmark
3GHz 2'000'000 10, .
100. : : 0,4 , : 0,5 , : 0,6 .
1,5 .

+3
3

- ( ).

, , . , 32- ints, 4 .

ps: , , Burrows-Wheeler (1 charecter per charecter), .

+1

​​, -, n-gram, , 3 . n-, , "" "hel", "lo". n- , . ( trie , ). n- , , n-, . n-. , . , , $.

+1

. , . , , . , .

( ) , . , , , , - .

0

Source: https://habr.com/ru/post/1710721/


All Articles