Adding a conditional query increases the time by more than 2400%

Update: I will get the request plan as soon as I can.

We had a poor fulfillment request that took 4 minutes for a specific organization. After the usual recompilation, the saved proc and update statistics did not help, we overwrote the if, if Exists (...) to select count (*) ... and the stored procedure, when from 4 minutes to 70 milliseconds. What is the problem with the condition that a 70 ms request will take 4 minutes? See Examples

They all take 4 minutes:

if ( SELECT COUNT(*) FROM ObservationOrganism omo JOIN Observation om ON om.ObservationID = omo.ObservationMicID JOIN Organism o ON o.OrganismID = omo.OrganismID JOIN ObservationMicDrug omd ON omd.ObservationOrganismID = omo.ObservationOrganismID JOIN SIRN srn ON srn.SIRNID = omd.SIRNID JOIN OrganismDrug od ON od.OrganismDrugID = omd.OrganismDrugID WHERE om.StatusCode IN ('F', 'C') AND o.OrganismGroupID <> -1 AND od.OrganismDrugGroupID <> -1 AND (om.LabType <> 'screen' OR om.LabType IS NULL)) > 0 print 'records'; 

-

 IF (EXISTS( SELECT * FROM ObservationOrganism omo JOIN Observation om ON om.ObservationID = omo.ObservationMicID JOIN Organism o ON o.OrganismID = omo.OrganismID JOIN ObservationMicDrug omd ON omd.ObservationOrganismID = omo.ObservationOrganismID JOIN SIRN srn ON srn.SIRNID = omd.SIRNID JOIN OrganismDrug od ON od.OrganismDrugID = omd.OrganismDrugID WHERE om.StatusCode IN ('F', 'C') AND o.OrganismGroupID <> -1 AND od.OrganismDrugGroupID <> -1 AND (om.LabType <> 'screen' OR om.LabType IS NULL)) print 'records' 

All this takes 70 milliseconds:

 Declare @recordCount INT; SELECT @recordCount = COUNT(*) FROM ObservationOrganism omo JOIN Observation om ON om.ObservationID = omo.ObservationMicID JOIN Organism o ON o.OrganismID = omo.OrganismID JOIN ObservationMicDrug omd ON omd.ObservationOrganismID = omo.ObservationOrganismID JOIN SIRN srn ON srn.SIRNID = omd.SIRNID JOIN OrganismDrug od ON od.OrganismDrugID = omd.OrganismDrugID WHERE om.StatusCode IN ('F', 'C') AND o.OrganismGroupID <> -1 AND od.OrganismDrugGroupID <> -1 AND (om.LabType <> 'screen' OR om.LabType IS NULL); IF(@recordCount > 0) print 'records'; 

It doesn't make sense to me why moving the same Count(*) request to an if statement causes such degradation or why "Exists" is slower than Count . I even tried exists() in select CASE WHEN Exists() , and it has 4 + more minutes.

+5
source share
1 answer

Given that my previous answer was mentioned, I will try to explain again, because these things are quite complicated. So yes, I think you are seeing the same problem as another question . Namely, the problem with goals .

So, to try to explain what causes this, I will start with three types of associations that are at the disposal of the engine (and to a large extent categorically): Loop Joins, Merge Joins, Hash Joins. Loop connections are what they sound, nested loops over both datasets. Merge Joins accept two sorted lists and move through them to the lock. And Hash joins in to drop everything in the smaller set into the storage cabinet, and then search for items in the larger set after filling the filling cabinet.

Thus, performance is wise, combining loops requires virtually no tuning, and if you are looking for only a small amount of data, they are really optimal. Merging is the best of the best, as far as connection performance is for any data size, but requires that the data is already sorted (which is rare). Hash Joins require a sufficient number of settings, but they allow you to quickly connect large data sets.

Now we will receive your request and the difference between COUNT(*) and EXISTS/TOP 1 . Thus, the behavior that you see is that the optimizer considers that the rows of this query are really likely (you can confirm this by planning the query without grouping and seeing how many records it considers last). In particular, he probably believes that for some table in this query, each record in this table will be displayed on the output.

"Eureka!" he says: "If every row in this table ends with an output to find if it exists, I can make a really cheap start-up loop connection, because although it is slow for large datasets, I only need one row." But then he does not find this line. And does not find him again. And now it iterates through a huge dataset using the least efficient means for weeding large amounts of data.

For comparison, if you are requesting a complete data score, it should find each record by definition. He sees a huge data set and selects the choice that is best for repeating this entire data set, not just its tiny tape.

If, on the other hand, this was indeed correct, and the records were very well correlated, he would find your record with the minimum possible number of server resources and maximize its overall throughput.

+3
source

Source: https://habr.com/ru/post/1240110/


All Articles