Can I use a full text index to find the closest matching lines? What does statistical semantics do in full-text indexing?

I am looking for SQL Server 2016 full text indexes and they are awesome for finding multiple word searches containing strings

When I try to compile a full text index, it shows Statistical Semantics as a check mark. What does statistical semantics do?

Also, I want to find, you mean queries

For example, let's say I have an entry like house . User enters hause

Can I use the full text index to return hause as the closest match and show the user that you meant the house effectively? thanks

I tried soundex, but the results it generates are terrible

It returns so many unrelated words

And since there are so many records in my database and I need very fast results, I need something SQL server initially supports

Any ideas? Any way to achieve such a thing using indexes?

I know that there are several algorithms, but they are not effective enough for use on the Internet. I mean calculating the editing distance between each record. They can be used for offline projects, but I need this efficiency in an online dictionary, where there will always be thousands of queries.

I already have a plan. Storage of not found results in the database and offline calculation of the closest matches. And use them as a cache. However, I wonder if there could be any possible online solution? Think there will be over 100 million nvarchar entries

+5
source share
1 answer

The short answer is no, Full Text Search cannot search for words that are similar, but different.

Full-text search uses stemmers and thesaurus files:

Stocker creates inflectional forms of a certain word based on the rules of this language (for example, "run", "run", "runner" - these are different forms of the word "run").

A full-text search thesaurus defines a set of synonyms for a specific language.

Both the stems and thesaurus are customizable, and you can easily find the FT house match for hause , but only if you add hause as a synonym for house . This is obviously not a solution, as it requires you to add all possible typos as a synonym ...

Semantic search is another topic, it allows you to search for documents that are semantically close to this example.

You want to find entries that have a short Levenshtein distance from a given word (the so-called "fuzzy" search). I do not know of any technique for creating an index that can respond to Levenshtein searches. If you are ready to scan the entire table for each term, there are T-SQL and CLR versions for Levenshtein.

+1
source

Source: https://habr.com/ru/post/1265775/


All Articles