Word Mapping in SQL Server

I have a requirement to provide the recommended correspondence between the data in the two database tables. Basic requirement: - You must specify a β€œmatch” for the highest number of matching words (regardless of order) between the two columns in question.

For example, given the data;

Table A Table B 1,'What other text in here' 5,'Other text in here' 2,'What am I doing here' 6,'I am doing what here' 3,'I need to find another job' 7,'Purple unicorns' 4,'Other text in here' 8,'What are you doing in here' Ideally, my desired matches would look as follows; 1 -> 8 (3 words matched) 2 -> 6 (5 words matched) 3 -> Nothing 4 -> 5 (4 words matched) 

I found word counting functions that look promising, but I can't figure out how to use it in an SQL statement, which will give me my desired match. Also, the related function is not quite what I need, since it uses charindex, which I think is looking for a word in the word (ie, "in" will match "bin").

Can anyone help me with this?

Thanks.

+5
source share
1 answer

I used sys.dm_fts_parser below to split sentences into words. There are many TSQL partitioning functions around if you are not on SQL Server 2008 or find that for some reason this is not suitable.

The requirement that each A.id can only be paired with a B.id that has not been used before, and vice versa, I could not think of an efficient set-based solution.

 ;WITH A(Id, sentence) As ( SELECT 1,'What other text in here' UNION ALL SELECT 2,'What am I doing here' UNION ALL SELECT 3,'I need to find another job' UNION ALL SELECT 4,'Other text in here' ), B(Id, sentence) As ( SELECT 5,'Other text in here' UNION ALL SELECT 6,'I am doing what here' UNION ALL SELECT 7,'Purple unicorns' UNION ALL SELECT 8,'What are you doing in here' ), A_Split AS (SELECT Id AS A_Id, display_term, COUNT(*) OVER (PARTITION BY Id) AS A_Cnt FROM A CROSS APPLY sys.dm_fts_parser('"' + REPLACE(sentence, '"', '""')+'"',1033, 0,0)), B_Split AS (SELECT Id AS B_Id, display_term, COUNT(*) OVER (PARTITION BY Id) AS B_Cnt FROM B CROSS APPLY sys.dm_fts_parser('"' + REPLACE(sentence, '"', '""')+'"',1033, 0,0)), Joined As (SELECT A_Id, B_Id, B_Cnt, Cnt = COUNT(*), CAST(COUNT(*) as FLOAT)/B_Cnt AS PctMatchBToA, CAST(COUNT(*) as FLOAT)/A_Cnt AS PctMatchAToB from A_Split A JOIN B_Split B ON A.display_term = B.display_term GROUP BY A_Id, B_Id, B_Cnt, A_Cnt) SELECT IDENTITY(int, 1, 1) as id, * INTO #IntermediateResults FROM Joined ORDER BY PctMatchBToA DESC, PctMatchAToB DESC DECLARE @A_Id INT, @B_Id INT, @Cnt INT DECLARE @Results TABLE ( A_Id INT, B_Id INT, Cnt INT) SELECT TOP(1) @A_Id = A_Id, @B_Id = B_Id, @Cnt = Cnt FROM #IntermediateResults ORDER BY id WHILE ( @@ROWCOUNT > 0 ) BEGIN INSERT INTO @Results SELECT @A_Id, @B_Id, @Cnt DELETE FROM #IntermediateResults WHERE A_Id = @A_Id OR B_Id = @B_Id SELECT TOP(1) @A_Id = A_Id, @B_Id = B_Id, @Cnt = Cnt FROM #IntermediateResults ORDER BY id END DROP TABLE #IntermediateResults SELECT * FROM @Results ORDER BY A_Id 

Returns

 A_Id B_Id Cnt ----------- ----------- ----------- 1 8 3 2 6 5 4 5 4 
+5
source

Source: https://habr.com/ru/post/891452/


All Articles