Regular logical OR

This is a purely academic regular expression exercise and my understanding of grouping multiple patterns. I have the following example string

<xContext id="ABC"> <xData id="DEF"> <xData id="GHI"> <ID>JKL</ID> <str>MNO</str> <str>PQR</str> <str> <order id="STU"> <str>VWX</str> </order> <order id="YZA"> <str>BCD</str> </order> </str> </xContext> 

Using C # Regex I am trying to extract groups of 3 capital letters.

At the moment, if I use the pattern >.+?</ , I get

 Found 5 matches: >JKL</ >MNO</ >PQR</ >VWX</ >BCD</ 

If I then use id=".+?"> , I get

 Found 5 matches: id="ABC"> id="DEF"> id="GHI"> id="STU"> id="YZA"> 

Now I am trying to combine them using the logic OR | for each term on both sides id="|>.+?">|</

However, this does not give me the combined results of both patterns.

My questions:

  • Can someone explain why this is not working properly?

  • How can I fix the template so that both shown results are combined in the specified order

  • How can I further improve the combined pattern to just write letters? I hope this is still ?<= And ?=< , But just want to check.

thanks

+4
source share
4 answers

Your regular expression does not know where to start or stop alternative options, separated by | . So you need to put them in subpatterns:

 (id="|>).+?(">|</) 

However, regex is not the right tool for XML parsing.

These parentheses also add capturing subpatterns. You can return it yourself. So:

 (id="|>)(.+?)(">|</) 

will return an integer match with index 0, the leading delimiter at index 1, the actual match you want at index 2, and the last delimiter at index 3. In most regular expression engines, you can do this:

 (?:id="|>)(.+?)(?:">|</) 

to avoid capturing delimiters. Now index 0 will have all matches, and index 1 will only have 3 letters. Unfortunately, I cannot tell you how to get them in C #.

+4
source

You need to group alternatives together

 (?:id="|>).+?(?:">|</) 

And to get letters, use positve lookbehind and lookahead statements

 (?<=id="|>).+?(?=">|</) 

See here at Regexr

Groups starting with ?<= And ?= Are zero-width statements, which means they don't match (what they match are not part of the result), they just β€œlook” backward or forward.

+2
source

I would suggest using the regex pattern (?:(?<=id=")|(?<=>)).+?(?=">|</)

Check here at RegExr.

+1
source

Capture FTW groups!

 @">(?<content>.+?)<|id=""(?<content>.+?)""" 

In particular, the named capture groups, because the taste of .NET regular expressions allows you to use the same group name as many times as you want in the same regular expression. Calling Groups["content"] of the match object returns content without regard to its location (i.e., Between two tags or the id attribute).

+1
source

Source: https://habr.com/ru/post/1437460/


All Articles