How to write a regular expression to "intersect" two regular expressions that you can use to match anywhere in a string

Given the two regular expressions, can we write a regular expression that represents their “intersection” in each of the following two difference senses, for matching anywhere in the string?

  • For two regular expressions, expr1 and expr2 can you write a regular expression that is the intersection of two given regular expressions in the sense of intersecting sets, and that the regular expression can be used to match anywhere in the string?

    For example, expr1 - (123|12345|abc|abcde) , and expr2 - (345|12345|abc|de) . I would like to find a regex that represents (12345|abc) .
    Therefore, applying the regular expression to blah12345blahabcdeblah , matches 12345 and abc , whereas when applying the regular expression to blah123blahabcblah , the match is abc .

    (?=^expr1$)(?=^expr2$).* has ^ and $ bindings that do not allow finding matches in the middle of the string.

  • Given the two regular expressions expr1 and expr2 , how can we write a regular expression that is the "intersection" of two given regular expressions, in which it consists of those lines, each of which

    • appears in at least one of two regular expressions and
    • has a prefix in the other two regular expressions,

    and regex can be used to match anywhere in a string?

    For example, expr1 is (123|abcde) , and expr2 is (12345|abc) . I would like to find a regex that represents (12345|abcde) .
    Therefore, applying the regular expression to blah12345blahabcdeblah , matches 12345 and abcde , applying the regular expression to blah123blahabcblah , there are no matches ( 123 and abc do not match).

The definition of “intersection” in part 2 is more natural than the definition in part 1 when applying a regular expression to a match in the middle of a line:

In the above example, when 12345 is a match, 123 also appears, so 12345 appears in both expr1 and expr2 and should be at their “intersection”. If 123 were a match, 12345 does not necessarily match, for example, in blah123blahabcblah , so 123 not considered a “junction”. There is a similar explanation for abcde and abc not in the “intersection”.

Thanks!

+4
source share
2 answers

Part 1

I haven’t found a solution yet; look at it a little more.

Answer to part 2

Here is a solution that works for regex engines that allow backlinks in lookbehinds like .NET and Matthew Barnett are a great regex module for Python.

In your example:

 (?x) (?=(12345|abc))(?=(123|abcde)) # AND the expressions (?: # take the longest match \1(?<=\2.*) #abcde, \1 is abc | \2(?<=\1.*) #12345, \2 is 123 ) 

Generally:

 (?x) (?=(expr1))(?=(expr2)) # AND the expressions (?: # take the longest match \1(?<=\2.*) | \2(?<=\1.*) ) 

I think this works ... but maybe there is a marginal case that I did not think about.

Here are some proven Python codes.

 import regex pattern = r'''(?x) (?=(12345|abc))(?=(123|abcde)) # AND the expressions (?: # take the longest match \1(?<=\2.*) #abcde, \1 is abc | \2(?<=\1.*) #12345, \2 is 123 ) ''' myregex = regex.compile(pattern) print ("--- blah12345blahabcdeblah ---") for match in myregex.finditer("blah12345blahabcdeblah"): print("Overall match: ", match.group(0)) print ("--- blah123blahabcblah ---") for match in myregex.finditer("blah123blahabcblah"): print("Overall match: ", match.group(0)) print ("--- blah12345blahabcdeblah12345 ---") for match in myregex.finditer("blah12345blahabcdeblah12378"): print("Overall match: ", match.group(0)) 
+2
source

They must do this:

  • /(?=expr1(.*)$)expr2(?=\1$)/
  • /(?=expr1)(?=expr2)/
+1
source

Source: https://habr.com/ru/post/970521/


All Articles