How does the IN predicate work in SQL?

Question

How does the IN predicate work in SQL?

After preparing the answer to this question, I found that I could not confirm my answer.

In my first programming task, I was told that the query in the IN () predicate is executed for every row contained in the parent query, and therefore, the use of IN should be avoided.

For example, given the request:

 SELECT count(*) FROM Table1 WHERE Table1Id NOT IN ( SELECT Table1Id FROM Table2 WHERE id_user = 1)

 Table1 Rows |  # of "IN" executions
 ----------------------------------
       10 |  10
      100 |  one hundred
     1000 |  1000
    10000 |  10,000

It is right? How does the IN predicate work?

+20

performance optimization sql

Gavin Miller Apr 17 '09 at 16:31

source share

7 answers

It will completely depend on the database used and the exact query.

Query optimizers are very smart - in your query example, I expect better databases to be able to use the same methods as when connecting. More naive databases can simply execute the same query many times.

+8

Jon Skeet Apr 17 '09 at 16:34

source share

It depends on the RDBMS .

See detailed analysis here:

In short:

MySQL will optimize the query for this:
```
 SELECT COUNT(*) FROM Table1 t1 WHERE NOT EXISTS ( SELECT 1 FROM Table2 t2 WHERE t2.id_user = 1 AND t2.Table1ID = t1.Table2ID ) 
```
and run the inner subquery in the loop using index search every time.
- SQL Server will use MERGE ANTI JOIN .
The internal subquery will not be “executed” in the general sense of the word; instead, the results of both the query and the subquery will be received simultaneously.
See the link above for more details.
- Oracle will use HASH ANTI JOIN .
The internal subquery will be executed once, and the hash table will be built from the result set.
External query values will be found in the hash table.
- PostgreSQL will use NOT (HASHED SUBPLAN) .
Much more than Oracle .

Note that the rewrite of the request is as follows:

 SELECT ( SELECT COUNT(*) FROM Table1 ) - ( SELECT COUNT(*) FROM Table2 t2 WHERE (t2.id_user, t2.Table1ID) IN ( SELECT 1, Table1ID FROM Table1 ) )

significantly improve performance in all four systems.

+5

Quassnoi Apr 17 '09 at 16:37

source share

Depends on the optimizer. Check the exact query plan for each particular request to see how RDBMS actually accomplishes this.

In Oracle, which will be:

 EXPLAIN PLAN FOR «your query»

In MySQL or PostgreSQL

 EXPLAIN «your query»

+4

vartec Apr 17 '09 at 16:34

source share

Most SQL systems these days will almost always create the same execution plan for LEFT JOIN, NOT IN and NOT EXISTS

I would say look at your execution plan and find out :-)

Also, if you have NULL values for the Table1Id column, you will not get any data back

+3

SQLMenace Apr 17 '09 at 16:34

source share

Not really. But it’s oil to write such queries using JOIN

0

Konstantin Tarkus Apr 17 '09 at 16:33

source share

Yes, but execution stops as soon as the query processor "finds" the value you are looking for ... So, if, for example, the first row in the outer select has Table1Id = 32, and if Table2 has an entry with TableId = 32, then as soon as the subquery will find the row in table2, where TableId = 32, it stops ...

0

Charles Bretana Apr 17 '09 at 16:38

source share

Bill Karwin · Accepted Answer · 2009-04-17 17:35

The warning you received about performing subqueries for each row is true for interconnected subqueries.

 SELECT COUNT(*) FROM Table1 a WHERE a.Table1id NOT IN ( SELECT b.Table1Id FROM Table2 b WHERE b.id_user = a.id_user );

Note that the subquery refers to the id_user column of the outer query. The id_user value for each row of Table1 may be different. Thus, the result of the subquery will probably be different, depending on the current row in the outer query. The RDBMS must execute the subquery many times, once for each row in the external query.

An example that you tested is an uncorrelated subquery . Most modern RDBMS optimizers worthy of their salt should be able to tell when the result of the subquery is independent of the values in each row of the external query. In this case, RDBMS launches the subquery at a time, caches its result and reuses it for the predicate in the external query.

PS: In SQL, IN() is called a "predicate", not an expression. A predicate is part of a language that evaluates to either true or false, but may not necessarily be executed independently as an operator. That is, you cannot just run this as an SQL query: "2 IN (1,2,3);" Although this is a valid predicate, it is not a valid statement.

How does the IN predicate work in SQL?

More articles: