T-SQL query on huge table is slow depending on join conditions

Question

T-SQL query on huge table is slow depending on join conditions

we have a huge table of companies (17 million records) for which we want to find duplicates in accordance with the search criteria (based on phone number). The request is very slow (5 minutes +)

Here is a simplified version of the request, but the problem is the same:

SELECT C1.*
FROM dbo.Company AS C1 WITH(NOLOCK)
INNER JOIN dbo.Company AS C2 ON C2.sTelephone = C1.sTelephone 
                         AND C1.iId_company != C2.iId_company 
                         AND (C1.iId_third_party_id IS NULL OR 
                              C2.iId_third_party_id IS NULL)

Column Explanation:

iId_company: primary key, integer auto-increment
sTelephone: company phone number, varchar with a non-clustered index on it
iId_third_party_id: A third-party ID may be empty when users insert their new companies (for this we want to find duplicates), an integer with a nonclustered index on it too.

, (), , ( , .

, :

C1.iId_third_party_id IS NULL , 5 .
, (1 +), , , .

UNION, ( ), , strong > .

+3

sql sql-server tsql sql-server-2008

MaxiWheat 14 . '09 16:10

7

Cătălin Pitiș · Answer 1 · 2009-10-14T16:20:38+0000

, , - . . , SQL Server, , .

marc_s · Answer 2 · 2009-10-14T16:27:28+0000

, - ( 17mio. row ), :

( SELECT C1. *!!)
" " .

SQL Server 2008, - (Common Table Expression - CTE). , (,!) , (, , !).

WITH PhoneDuplicates AS
(SELECT c.Telephone, COUNT(*) as PhoneCount
   FROM dbo.Company AS c 
   GROUP BY c.Telephone
   HAVING COUNT(*) > 1
)
SELECT 
  (list of fields from company table)
FROM
  dbo.Company AS c
INNER JOIN
  PhoneDuplicates as PD ON PD.Telephone = c.Telephone

Philip Kelley · Answer 3 · 2009-10-14T16:44:45+0000

, ?

,

C1.iId_third_party_id IS NULL

, SQL ( , ), .

(... OR C2.iId_third_party_id IS NULL)

, SQL , , .

, / ? , - marc_s ( ), .

, - . , , , .

Mladen Prajdic · Answer 4 · 2009-10-14T16:46:49+0000

, , , . , 2 , .

, row_number:

;with cteDupes(RN, DupeID, DupeTelephone) as
(
SELECT  row_number() over(partition by sTelephone order by iId_company, sTelephone) RN,
        iId_company, sTelephone
FROM    dbo.Company 
WHERE   iId_third_party_id IS NULL
)
select * from cteDupes
where RN > 1

. , .

Antony Koch · Answer 5 · 2009-10-14T16:29:38+0000

-, ,

SELECT C1.*
FROM (select * from dbo.Company where iId_third_party_id IS NULL) AS C1 WITH(NOLOCK)
INNER JOIN (select * from dbo.Company where iId_third_party_id IS NULL) AS C2 ON C2.sTelephone = C1.sTelephone 
                         AND C1.iId_company != C2.iId_company

.

Cade roue · Answer 6 · 2009-10-14T16:32:35+0000

, (sTelephone, iId_third_party_id). ?

.

Outside the top of my head, without seeing the plan, I would think about adding iId_third_party_idto a non-clustered index on sTelephone, and if you do not cluster the primary key, add iId_companyto the index as well.

Please note that there is also the possibility of cross-combining the results if there are more than two duplicates for a given phone number.

RisingCascade · Answer 7 · 2009-10-14T16:33:38+0000

With Temp as
(Select *
FROM dbo.Company as c
Where c.iId_third_party_id is NULL)

Select C1.*
From temp as C1 With (NoLock)
INNER JOIN Temp AS C2 
ON C2.sTelephone = C1.sTelephone AND C1.iId_company != C2.iId_company

Something like this might work

T-SQL query on huge table is slow depending on join conditions

More articles: