Avoid SQL Not IN with Replace and Length Check

Question

Avoid SQL Not IN with Replace and Length Check

I have a situation where I have to dynamically create my SQL strings, and I try to use paramators and sp_executesql where possible so that I can reuse query plans. With a lot of online reading and personal experience, I found that “NOT IN” and “INNER / LEFT JOIN” are slow executors and expensive when the main (leftmost) table is large (1.5M rows with 50 columns). I also read that you should avoid using any functions as this slows down the queries, so I wonder what is worse?

I have used this workaround in the past, although I'm not sure if this is the best thing to do to avoid using “NOT IN” with a list of items when, for example, I go to a list of 3 character strings with, for example, a pipe separator ( only between items):

LEN(@param1) = LEN(REPLACE(@param1, [col], ''))

instead:

 [col] NOT IN('ABD', 'RDF', 'TRM', 'HYP', 'UOE')

... imagine that the list of strings has from 1 to 80 possible long values, and this method does not provide itself with the possibility of smoothing.

In this example, I can use "=" for NOT IN, and I would use the traditional list technique for my IN, or! = if it’s faster, although I doubt it. Is it faster than using NOT IN?

As a possible third alternative, that if I knew all the other possibilities (IN capabilities, which could potentially be more than 80-95x), instead transmit them; this will be done in the business layer of the application to take the load off SQL Server. It's not a good opportunity to reuse a query plan, but if it saves a second or two on a big nasty query, why the hell not.

I am also good at creating SQL CLR. So how would string manipulation above be a better CLR function?

Thoughts?

Thanks in advance for any help / advice, etc.

+2

c # sql .net sql-server tsql

user418754 20 sept '10 at 23:48

source share

3 answers

I found that "NOT IN" and "INNER / LEFT JOIN" are slow executors and expensive when the main (leftmost) table is large

This should not be slow if you have specified the table correctly. Something that can make a query slow is if you have a dependent subquery. That is, the query must be reevaluated for each row in the table, because the subquery refers to values from an external query.

I also read that you should avoid using any type of function as it slows down requests

It depends. SELECT function(x) FROM ... probably won't make much difference to performance. Problems arise when using the column function elsewhere in the query, such as the JOIN clause, the WHERE clause, or the ORDER BY clause, as this may mean that the index cannot be used. However, the constant value function is not a problem.

As for your query, I will first try using [col] NOT IN ('ABD', 'RDF', 'TRM', 'HYP', 'UOE') . If this happens slowly, make sure you specify the table correctly.

+2

Mark byers 20 sept '10 at 23:58

source share

Firstly, since you only filter out a small percentage of records, most likely the index on col not used at all so the SARG ability is debatable.

Thus, this will result in reuse of the query plan.

If you are on SQL Server 2008, replace @param1 with the table-valued parameter and pass your application instead of the delimited list. This completely solves your problem.
If you are running SQL Server 2005, I do not think this is important. You can split the delimited list and use NOT IN / NOT EXISTS for the table, but what's the point if you don't get the index, find col ?

Can anyone talk to the last paragraph? Will there be a division of the list into the var table, and then anti-join to it, with the exception of a sufficient number of processor cycles to compensate for the installation cost?

EDIT , the third method for SQL Server 2005 using XML, based on the OMG Ponies link:

 DECLARE @not_in_xml XML SET @not_in_xml = N'<values><value>ABD</value><value>RDF</value></values>' SELECT * FROM Table1 WHERE @not_in_xml.exist('/values/value[text()=sql:column("col")]') = 0

I have no idea how well this performs compared to a limited list or TVP.

0

Peter Radocchia 21 sept '10 at 3:22

source share

VladV · Accepted Answer · 2010-09-21T04:58:00+0000

As Donald Knuth often quotes (mis), "premature optimization is the root of all evil."
So, firstly, are you sure that if you write your code in the clearest and easiest way (both for writing and reading), does it work slowly? If not, check it out before you start using smart optimization tricks.

If the code is slow, check query plans. In most cases, executing a query takes much longer than compiling the queries, so usually you don’t have to worry about reusing the query plan. Therefore, the construction of optimal indexes and / or tabular structures usually gives significantly better results than tuning the methods for constructing a query.

For example, I seriously doubt that your query with LEN and REPLACE has better performance than NOT IN - in any case, all rows will be checked and checked for compliance. For a long list, the MSSQL optimizer will automatically create a temporary table to optimize equality comparison.
Moreover, such tricks tend to lead to errors: say, your example will not work correctly if [col] = 'AB'.

In queries it is often more often than NOT IN, because for IN queries it is enough to check only a part of the lines. The effectiveness of the method depends on whether you can get the correct list for IN enough.

Speaking of passing a variable-length list to the server, there is a lot of discussion about SO and elsewhere. Typically, your options are:

table parameters (MSSQL 2008+ only),
dynamically built SQL (error-prone and / or unsafe),
temp tables (useful for long lists, perhaps too much write overhead and runtime for short ones)
separable strings (useful for short lists of "good" values - like multiple integers),
XML parameters (somewhat complex, but work well - if you use a good XML library and don’t create complex XML text manually).

Here is an article with a good overview of these methods and a few more.

Avoid SQL Not IN with Replace and Length Check

More articles: