Similarity Between Rows - SQL Server 2005

Question

Similarity Between Rows - SQL Server 2005

I am looking for an easy way (UDF?) To establish the similarity between the lines. The SOUNDEX and DIFFERENCE functions do not seem to do the job.

The similarity should be based on the number of common characters (questions of order).

For instance:

Spiruroidea sp. AM-2008

and

Spiruroidea gen. sp AM-2008

should be recognized as similar.

Any pointers would be much appreciated.

Thanks.

Christian

+4

sql sql-server-2005 user-defined-functions

cs0815 Apr 12 '10 at 11:46

source share

2 answers

These things are not trivial, and you should provide more examples.

As already mentioned, the distance of Daniel Levenshtein is the way to go, but for your example you can pre-process the lines if you know that you can safely drop certain words - for example, from your example it is clear that the word is generation. can be dropped.

Distance levenshtein will consider any four-word word instead of gen. like the gene. which may not be what you want.

In addition, if your dataset comes from different data sources, you might consider creating a synonym dictionary and exploring existing standard taxonomies for your domain. Perhaps for example this ?

+1

Unreason Apr 12 '10 at 12:15

source share

Daniel Vassallo · Accepted Answer · 2010-04-12T11:50:23+0000

You might want to consider using the Levenshtein Distance algorithm as UDF so that it returns the number of operations that must be performed on String A, so that it becomes String B. This is often called the editing distance .

You can then compare the result of the Levenshtein distance function with a fixed threshold or against the percentage length of row A or row B.

You would simply use it as follows:

 WHERE LEVENSHTEIN(Field_A, Field_B) < 4;

You might want to check out the following Levenshtein distance implementation for SQL Server:

Levenshtein distance algorithm: TSQL implementation

Similarity Between Rows - SQL Server 2005

More articles: