T-SQL query on huge table is slow depending on join conditions

we have a huge table of companies (17 million records) for which we want to find duplicates in accordance with the search criteria (based on phone number). The request is very slow (5 minutes +)

Here is a simplified version of the request, but the problem is the same:

SELECT C1.*
FROM dbo.Company AS C1 WITH(NOLOCK)
INNER JOIN dbo.Company AS C2 ON C2.sTelephone = C1.sTelephone 
                         AND C1.iId_company != C2.iId_company 
                         AND (C1.iId_third_party_id IS NULL OR 
                              C2.iId_third_party_id IS NULL)

Column Explanation:

  • iId_company: primary key, integer auto-increment
  • sTelephone: company phone number, varchar with a non-clustered index on it
  • iId_third_party_id: A third-party ID may be empty when users insert their new companies (for this we want to find duplicates), an integer with a nonclustered index on it too.

, (), , ( , .

, :

  • C1.iId_third_party_id IS NULL , 5 .
  • , (1 +), , , .

UNION, ( ), , strong > .

+3
7

, , - . . , SQL Server, , .

+2

, - ( 17mio. row ), :

  • ( SELECT C1. *!!)
  • " " .

SQL Server 2008, - (Common Table Expression - CTE). , (,!) , (, , !).

WITH PhoneDuplicates AS
(SELECT c.Telephone, COUNT(*) as PhoneCount
   FROM dbo.Company AS c 
   GROUP BY c.Telephone
   HAVING COUNT(*) > 1
)
SELECT 
  (list of fields from company table)
FROM
  dbo.Company AS c
INNER JOIN
  PhoneDuplicates as PD ON PD.Telephone = c.Telephone

+2

, ?

,

C1.iId_third_party_id IS NULL 

, SQL ( , ), .

(... OR C2.iId_third_party_id IS NULL)

, SQL , , .

, / ? , - marc_s ( ), .

, - . , , , .

+1

, , , . , 2 , .

, row_number:

;with cteDupes(RN, DupeID, DupeTelephone) as
(
SELECT  row_number() over(partition by sTelephone order by iId_company, sTelephone) RN,
        iId_company, sTelephone
FROM    dbo.Company 
WHERE   iId_third_party_id IS NULL
)
select * from cteDupes
where RN > 1

. , .

+1

-, ,

SELECT C1.*
FROM (select * from dbo.Company where iId_third_party_id IS NULL) AS C1 WITH(NOLOCK)
INNER JOIN (select * from dbo.Company where iId_third_party_id IS NULL) AS C2 ON C2.sTelephone = C1.sTelephone 
                         AND C1.iId_company != C2.iId_company 

.

0

, (sTelephone, iId_third_party_id). ?

.

Outside the top of my head, without seeing the plan, I would think about adding iId_third_party_idto a non-clustered index on sTelephone, and if you do not cluster the primary key, add iId_companyto the index as well.

Please note that there is also the possibility of cross-combining the results if there are more than two duplicates for a given phone number.

0
source
With Temp as
(Select *
FROM dbo.Company as c
Where c.iId_third_party_id is NULL)

Select C1.*
From temp as C1 With (NoLock)
INNER JOIN Temp AS C2 
ON C2.sTelephone = C1.sTelephone AND C1.iId_company != C2.iId_company

Something like this might work

0
source

Source: https://habr.com/ru/post/1720090/


All Articles