I wonder how full-text search systems are being introduced to be able to query millions of records very quickly? Please note: I'm not talking about systems that symbolize content, dividing it into spaces, but about a system that is capable of requesting even parts from the middle of tokens (which is a real challenge).
Background information
I experimented with the cacher home string (using Java), which is able to search for strings specified by a substring as a query. Substring is not required . to be at the beginning of potential extracted lines.
It works with a huge array of strings. Caching is done with
TreeMap<Character,TreeSet<String>>.
Adding a record
For each unique character in the string to be added:
Get the set for this character and add a string to it.
Example: "test" is first divided into "t", "e", "s".
Then we extract the sets for these three keys and add a “test” to each of the sets.
Querieng
A query is performed by splitting the query into unique characters, get for each character a Set<String>, construct the intersection of all sets, and finally find the intersection using contains()to make sure the order of the characters in the query is correct.
Benchmark
3GHz 2'000'000 10, .
100. : : 0,4 , : 0,5 , : 0,6 .
1,5 .