I have a SQL Server product table, and each product has a description that is available on our website. I want to prevent or at least warn our users when the description is too similar to another description of the product. Each product description length can vary greatly.
I would like to request products with descriptions that contain duplicates / similar paragraphs / blocks of text among themselves. those. line A has a bunch of unique content, but has a similar / identical paragraph w / string B. However, I'm not sure which similarity algorithm is best to use:
Fuzzy hashing is like the sounds that I'm looking for, but I'm not just looking for duplicate content with subtle differences. I am also looking for duplicate content with subtle differences nested in a unique block of text . And I had no idea how to implement fuzzy hashes in SQL. SOUNDEX () and DIFFERENCE () seem to use fuzzy hashing, but they are quite inaccurate for my use.
Ideally, the SQL affinity function would be fast, but I can store the caching values ββin another table and schedule a task that is sometimes updated.
What is the best implementation of the / SQL algorithm (or CLR) to accomplish this?
source share