Optimization: WHERE x IN (1, 2 .., 100.000) vs INNER JOIN tmp_table USE (x)?

Question

Optimization: WHERE x IN (1, 2 .., 100.000) vs INNER JOIN tmp_table USE (x)?

Recently I attended an interesting interview. There I was asked a question about query optimization with a WHERE..IN containing a long list of scalars (thousands of values, that is). This question is NOT about the subqueries in the IN clause, but about a simple list of scalars.

I immediately replied that this can be optimized using INNER JOIN with another table (possibly temporary) that will contain only those scalars. My answer was accepted, and the reviewer had a note: "no database engine can optimize the long WHERE..IN conditions at WHERE..IN to be efficient enough." I nodded.

But when I left, I had some doubts. The condition seemed rather trivial and was widely used for modern RDBMS so as not to optimize it. So, I started digging.

PostgreSQL:

PostgreSQL seems to be parsing IN() scalar constructs into a ScalarArrayOpExpr structure , which is sorted . This structure is later used during index scanning to find matching rows. EXPLAIN ANALYZE for such queries displays only one cycle. No joins are performed. Therefore, I expect such a query to be even faster than INNER JOIN. I tried some queries in my existing database, and my tests proved this position. But I did not care about the purity of the test and that Postgres was under Vagrant, so I could be wrong.

MSSQL Server:

MSSQL Server builds a hash structure from a list of constant expressions, and then a hash join to the source table . I think that sorting did not seem to be there. I have not done any tests since I have no experience with this RDBMS.

MySQL Server:

On the 13th of these slides, it says that prior to 5.0 this problem did occur in MySQL with some cases. But other than this, I did not find any other problem related to poor handling of IN () . Unfortunately, I did not find any evidence to the contrary. If yes, please hit me.

SQLite:

The documentation page suggests some problems, but I am inclined to believe that the things described are really at a conceptual level. No other information was found.

So, I'm starting to think that I misunderstood my interviewer or misused Google;) Or maybe it's because we did not set any conditions and our conversation became a bit vague (we did not specify any specific DBMS or others conditions. It was just an abstract conversation).

It seems like the days when the databases rewrote IN() as a set of OR statements (which can sometimes cause problems with NULL values in lists, a battle). Or not?

Of course, in cases where the list of scalars is longer than the valid database protocol packet, INNER JOIN may be the only solution available.

I think that in some cases, query parsing time (if it was not prepared) in itself can kill performance.

Databases may also be unable to prepare an IN(?) Query, which will cause it to repeat again (which may lead to poor performance). In fact, I have never tried, but I think that even in such cases, parsing and scheduling queries is not cumbersome compared to query execution.

But apart from this, I do not see other problems. Well, apart from the problem, just GOING this problem. If you have queries containing thousands of identifiers inside, something is wrong with your architecture.

You?

+5

mysql sql-server sqlite postgresql query-optimization

Vladislav Rastrusny Dec 02 '15 at 15:46

source share

2 answers

I think this is a bad app design. These values using the IN operator are most likely not hardcoded, but dynamic. In this case, we should always use prepared statements only as a reliable mechanism to prevent SQL injection. In each case, this will lead to dynamic formatting of the prepared statement (since the number of placeholders is also dynamic), and also lead to excessive hard parsing (as many unique queries as we have IN - IN (?) , IN (?,?) Values, ...). I would either load these values into a table, like using join, as you mentioned (if the load is too expensive), or use the Oracle IN foo(params) pipeline function, where params argument can be a complex structure (array) coming from memory (PLSQL / Java, etc.). If the number of values is greater, I would consider using EXISTS (select from mytable m where m.key=x.key) or EXISTS (select x from foo(params) instead of IN . In this case, EXISTS provides better performance than IN .

-1

rolish Dec 02 '15 at 16:05

source share

Gordon linoff · Accepted Answer · 2015-12-02T16:10:49+0000

Your answer is correct only if you create an index (preferably the primary key index) in the list, if only this list is small.

Any description of optimization, of course, depends on the specific database. However, MySQL describes in some detail how it optimizes in :

Returns 1 if expr is equal to any of the values in the IN list; otherwise, returns 0. If all values are constants, they are evaluated according to type expr and sorting. Then the object is searched using binary search. This means that IN is very fast if the value of the IN List consists solely of constants.

This will definitely be the case when using in will be faster than using another table, and probably faster than another table using the primary key index.

I think SQL Server is replacing in with an OR s list. Then they will be implemented as sequential comparisons. Note that sequential comparisons may be faster than binary searches if some elements are much more common than others and they appear first in the list.

Optimization: WHERE x IN (1, 2 .., 100.000) vs INNER JOIN tmp_table USE (x)?

More articles: