How to define a regex with multiple OR operators, where each term includes a space prefix and suffix?

I am preparing for the task of extracting data. I need to delete a set of terms; none, some or all may be present on each line of the source record. There are over 100,000 target entries. I want to avoid performing one-time matches / substituting actions, since (a) the list of conditions to be reduced is likely to grow, and (b) the time to complete the current match / substitution action for one term at a time is unacceptable.

My question is: how do I change the regex to include each term in a dedicated OR list?

REGULAR EXPRESSION

' and | and or | ao | company | co | co | dba | dba ' 

DESIRED BEHAVIOR

Replace each term found (including prefix and suffix spaces) with a single space.

ACTUAL BEHAVIOR

Each found term "even" (as opposed to "odd") is replaced (including the prefix and suffixes) by one space.

Example

Source string

 ' MASHABLE LTD DBA THE INFORMATION EXPERTS and and or ao company co co dba dba COPYRIGHT ' 

Result String (desired behavior)

 ' MASHABLE LTD THE INFORMATION EXPERTS COPYRIGHT ' 

Result String (Actual Behavior)

 ' MASHABLE LTD THE INFORMATION EXPERTS and or company codba COPYRIGHT ' 

WEDNESDAY

SQL Server 2005

Custom function regexReplace based on VBScript.RegExp (code is available at the end of the message)

THE CODE

 set nocount on declare @source [varchar](800) declare @regexp [varchar](400) declare @replace [char](1) declare @globalReplace [bit] declare @ignoreCase [bit] declare @result [varchar](800) set @globalReplace = 1 set @ignoreCase = 1 SET @source = ' MASHABLE LTD DBA THE INFORMATION EXPERTS and and or ao company co co dba dba COPYRIGHT ' set @regexp = ' and | and or | ao | company | co | co | dba | dba ' set @replace = ' ' select @result = master.dbo.regexReplace(@source,@regexp,@replace,@globalReplace,@ignoreCase) print @result 

... result:

  MASHABLE LTD THE INFORMATION EXPERTS and or company codba COPYRIGHT 

* dbo.regex Replace user-defined function definition *

 CREATE FUNCTION [dbo].[regexReplace] ( @source varchar(5000), @regexp varchar(1000), @replace varchar(1000), @globalReplace bit = 0, @ignoreCase bit = 0 ) RETURNS varchar(1000) AS BEGIN DECLARE @hr integer DECLARE @objRegExp integer DECLARE @result varchar(5000) EXECUTE @hr = sp_OACreate 'VBScript.RegExp', @objRegExp OUTPUT IF @hr <> 0 BEGIN EXEC @hr = sp_OADestroy @objRegExp RETURN NULL END EXECUTE @hr = sp_OASetProperty @objRegExp, 'Pattern', @regexp IF @hr <> 0 BEGIN EXEC @hr = sp_OADestroy @objRegExp RETURN NULL END EXECUTE @hr = sp_OASetProperty @objRegExp, 'Global', @globalReplace IF @hr <> 0 BEGIN EXEC @hr = sp_OADestroy @objRegExp RETURN NULL END EXECUTE @hr = sp_OASetProperty @objRegExp, 'IgnoreCase', @ignoreCase IF @hr <> 0 BEGIN EXEC @hr = sp_OADestroy @objRegExp RETURN NULL END EXECUTE @hr = sp_OAMethod @objRegExp, 'Replace', @result OUTPUT, @source, @replace IF @hr <> 0 BEGIN EXEC @hr = sp_OADestroy @objRegExp RETURN NULL END EXECUTE @hr = sp_OADestroy @objRegExp IF @hr <> 0 BEGIN RETURN NULL END RETURN @result END 
+4
source share
3 answers

Try the following:

 (?: (?:and or|and|ao|company|co|co|dba|dba))+(?!\S)/i 

Like @ Mathematical.coffee, I started by factoring the leading space and replaced the ending space with lookahead - in this case, a negative look at a non-whitespace character. This way, it will work even if the token is the last in the line and does not follow a space. But the most important change is to replace two or more matches at a time when possible.

+2
source

This is not a SQL Server issue. This is a common RegEx problem, not just the one included in the VBScript engine that you access through COM. The problem is that matches actually overlap between old and new prefix spaces.

I tried your example at http://www.regextester.com/ and it does the same.

"and / or which is not first replaced, actually consists of a space of the first and which were replaced by a space, and then the remaining text.

I would look at replacing with replacing words: Matches a regular expression and replaces a word limited to certain characters

0
source

I would recommend this regex:

 ( (and(?: or)?|ao|company|c ?o|d ?b ?a)(?= )) 

First of all, I put the prefix / suffix spaces outside your OR brackets (efficiency):

 ( (and(?: or)?|ao|company|c ?o|d ?b ?a) ) 

However, when you use this regular expression, your matches match. For example, and and or first matches and , but then the remaining line, and or , which does not have previous space.

So, to get around this, I changed the last space to a positive look. It says, “make sure that this template is followed by a space,” but does not match the space itself.

Therefore, passing through and and or , it matches and and leaves and or , which also matches the pattern. This more or less eliminates the match matching problem. This will not match one of your words if it appears at the end of the line, but your original regular expression did not work.

You can see it in action on the regexr website. Note: if you replace each match with a space, you will get too many spaces:

 MASHABLE LTD THE INFORMATION EXPERTS COPYRIGHT 

But you would have a problem with your original regex. If you completely remove the matches, you will get:

 MASHABLE LTD THE INFORMATION EXPERTS COPYRIGHT 
0
source

Source: https://habr.com/ru/post/1391915/


All Articles