SQL - similarity between two rows of variable length

I have a SQL Server product table, and each product has a description that is available on our website. I want to prevent or at least warn our users when the description is too similar to another description of the product. Each product description length can vary greatly.

I would like to request products with descriptions that contain duplicates / similar paragraphs / blocks of text among themselves. those. line A has a bunch of unique content, but has a similar / identical paragraph w / string B. However, I'm not sure which similarity algorithm is best to use:

Fuzzy hashing is like the sounds that I'm looking for, but I'm not just looking for duplicate content with subtle differences. I am also looking for duplicate content with subtle differences nested in a unique block of text . And I had no idea how to implement fuzzy hashes in SQL. SOUNDEX () and DIFFERENCE () seem to use fuzzy hashing, but they are quite inaccurate for my use.

Ideally, the SQL affinity function would be fast, but I can store the caching values ​​in another table and schedule a task that is sometimes updated.

What is the best implementation of the / SQL algorithm (or CLR) to accomplish this?

+6
source share
1 answer

I recently had to join group names with a fuzzy string.
I tried about 40 different algorithms, but none of them were good enough for this, although the entries in the group names differed only in some spelling errors, lack of spaces and the accidental addition of _mLF at the end.

So, if you try to do such a thing, I highly recommend that you stop right now and send the data (in my case an Excel file) back to users to fix where they belong.

If you are really interested in comparing strings, this link may be exactly what you need:
http://anastasiosyal.com/POST/2009/01/11/18.ASPX

I found that the Jaro-Winkler function gives the best results in my case, but you can check it out for yourself.

+2
source

Source: https://habr.com/ru/post/953201/


All Articles