SQL inner join with subquery

I am working on the following queries:

Query 1: SELECT * From TabA INNER JOIN TabB on TabA.Id=TabB.Id Query 2: SELECT * From TabA WHERE Id in (SELECT Id FROM TabB) Query 3: SELECT TabA.* From TabA INNER JOIN TabB on TabA.Id=TabB.Id 

I am investigating these queries using the SQL Server profiler and have discovered some interesting facts.

  • Request 1 takes 2,312 seconds
  • Request 2 takes 0.811 seconds
  • Request 3 takes 0.944 seconds

TabA 48716 lines

TabB 62719 lines

Basically, I ask why Query 1 is time consuming and not Query 3. I already know that "sub query" is slower than the inner join, but here Query 2 is the fastest; why?

+6
source share
4 answers

If I had to guess, I would say because query 1 pulls data from both tables. Queries 2 and 3 (at about the same time) pull data only for TabA.

One way to verify this is to run the following:

 SET STATISTICS TIME ON SET STATISTICS IO ON 

When i started

 SELECT * FROM sys.objects 

I saw the following results.

 SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 104 ms. (242 row(s) affected) Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'sysschobjs'. Scan count 1, logical reads 10, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'syssingleobjrefs'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'syspalnames'. Scan count 1, logical reads 2, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 866 ms. 

You can see the number of scans, logical reads, and physical readings for each request. Of course, physical readings take much more time and are read from disk to cache. If all your reads are logical reads, your table is completely in the cache.

I would argue, if you look, you will see much more logical readings in TabB at request 1 than on 2 and 3.

EDIT:

Just out of curiosity, I did some tests and posted the results here .

+3
source

Request 1:
This query returns rows from all rows in TabA and TabB, so the coverage index for both tables requires all rows from each table to be included. To find out what exactly is happening, you want to look at the query plan.

Request 2 and request 3:
You are returning all rows from TabA, and you only need the index for the Id column for TabB. I assume that the difference here has something to do with table statistics, but (again) we will need to see that the query plan knows exactly what is happening.

+2
source

This is simply because SQL does not need to execute JOIN. You simply execute two queries, and only one of them has a WHERE clause.

I must admit that I did not expect such a big difference.

+1
source

If the connection is one for many, it is possible that the time spent on duplicate data. Instead, you can format the set of related strings as a JSON array. Check out β€œUse Case 1” at https://blogs.msdn.microsoft.com/sqlserverstorageengine/2015/10/09/returning-child-rows-formatted-as-json-in-sql-server-queries/

0
source

Source: https://habr.com/ru/post/944125/


All Articles