Comparing T-SQL Patterns with Exceptions

Question

Comparing T-SQL Patterns with Exceptions

Here is a problem that I have encountered repeatedly when playing with Stack Exchange Data Explorer , which is based on T-SQL:

How to search for a string, except when it occurs as a substring of any other string?

For example, how can I select all the entries in the MyTable table, where the MyCol column contains the row foo , but ignores any foo that are part of the foobar row?

A quick and dirty attempt would look something like this:

 SELECT * FROM MyTable WHERE MyCol LIKE '%foo%' AND MyCol NOT LIKE '%foobar%'

but obviously this will not match, for example. MyCol = 'not all foos are foobars' , which I want to map.

One solution that I came up with is to replace all occurrences of foobar with some dummy marker (which is not a substring of foo ), and then check all the remaining foo s, as in:

 SELECT * FROM MyTable WHERE REPLACE(MyCol, 'foobar', 'X') LIKE '%foo%'

This works, but I suspect that it is not very efficient, since it has to run REPLACE() for every record in the table. (For SEDE, this will usually be a Posts table, which currently has about 30 million rows.) Are there any better ways to do this?

(FWIW, the real use case that raised this question was looking for SO messages with image URLs that use the http:// scheme prefix but not point to i.stack.imgur.com host.)

+5

sql-server tsql dataexplorer

Ilmari karonen Feb 01 '16 at 11:47

source share

4 answers

A three-stage filter should work:

collect all lines matching "% foo%";
replace all instances of "foobar" with an inconsistent string (for example, "maybe");
Recheck the match "% foo%"

Here you are doing REPLACE only for potential rows, not for all rows. If you expect only a small percentage of matches, this should be much more effective.

SQL will look like this:

 ;with data as ( select * from MyTable where MyCol like '%foo%' ) select * from data where replace(MyCol, 'foobar', 'X') like '%foo%'

Note that an additional query is required because SQL does not have short abbreviations for the expression; the engine is free to change logical terms as necessary for efficient processing at the same level of requests.

+1

Pieter geerkens Feb 01 '16 at 11:58

source share

This will be faster than your current request:

 SELECT * FROM MyTable WHERE MyCol like '%foo%' AND REPLACE(MyCol, 'foobar', 'X') LIKE '%foo%'

REPLACE is calculated after applying MyCol, so this is faster than simple:

 REPLACE(MyCol, 'foobar', 'X') LIKE '%foo%'

+1

t-clausen.dk Feb 01 '16 at 13:28

source share

Assuming you are only interested in finding instances of foo with the spaces surrounding them

  SELECT * FROM MyTable WHERE MyCol LIKE 'foo %' OR MyCol LIKE '% foo %' OR MyCol LIKE '% foo'

0

Paul hunt Feb 01 '16 at 11:58

source share

Martin smith · Accepted Answer · 2016-02-01T20:28:56+0000

None of the methods provided so far is guaranteed to work as advertised and only perform REPLACE on a subset of strings.

SQL Server does not guarantee predicate short circuits and can move computational scalars to the underlying query for views and CTEs .

The only thing ( mostly ) guaranteed is the CASE statement. Below I use the IIF syntactic sugar row, which expands to CASE

 SELECT * FROM MyTable WHERE 1 = IIF(MyCol LIKE '%foo%', IIF(REPLACE(MyCol, 'foobar', 'X') LIKE '%foo%', 1, 0), 0);

Comparing T-SQL Patterns with Exceptions

More articles: