Order an alternative

I created a complex regular expression to extract data from a page of text. For some reason, the alternation order is not what I expect. A simple example:

((13th|(Executive |Residential)|((\w+) ){1,3})Floor)

Simply put, I try either to get the gender number, the known named gender, and as a backup, I capture 1-3 unknown words and then the gender just in case, to consider later (I actually use the group name to identify this, but did not want to confuse the problem)

The problem is that the line

on the 13th Floor

I do not receive 13th FloorI receive on the 13th Floor, which seems to indicate that it corresponds to the 3rd series. I would expect it to correspond to the 13th floor. I set this on purpose (or, as I thought) to prioritize match types and leave the undefined ones only if others are missing. I think they were not joking when they said that Regix is ​​greedy, but I don’t understand how to do this in order to be “greedy” and behave the way I want.

+4
source share
2 answers

Well, an automaton is worth 1000 words:

Regular expression visualization

play with him

, \w+ -regex . , @rigderunner , NFA , \w+ , Floor, 13th Executive Residential . , .

, , , , :

xxxx yyyy zzz tttt Floor

, , : , , . , .

, - , , , , , , .

, , :

((13th|Executive|Residential) +Floor)

((\w+ +){1:3}Floor)

N.B.: , , , NFA. , .

+3

-, :

tidied = re.compile(r"""
    (                   # $1: ...
      (                 # $2: One ... from 3 alternatives.
        13th            # Either a1of3.
      | (               # Or a2of3 $3: One ... from 2 alternatives.
          Executive[ ]  # Either a1of2.
        | Residential   # Or a2of2.
        )               # End $3: One ... from 2 alternatives.
      | (               # Or a3of3 $4: Last match from 1 to 3 ...
          (\w+)         # $5: ...
          [ ]           #
        ){1,3}          # End $4: Last match from 1 to 3 ...
      )                 # End $2: One ... from 3 alternatives.
      Floor             #
    )                   # End $1: ...
    """, re.VERBOSE)

, , . , :

tidied = re.compile(r"""
    (               # $1: One ... from 4 alternatives.
      13th          # Either a1of4.
    | Executive[ ]  # Or a2of4.
    | Residential   # Or a3of4.
    | (             # Or a4of4 $2: Last match from 1 to 3 ...
        (\w+)       # $3: ...
        [ ]         #
      ){1,3}        # End $2: Last match from 1 to 3 ...
    )               # End $1: One ... from 4 alternatives.
    Floor           #
    """, re.VERBOSE)

, : Floor. - , . NFA . , , . , ( , Floor). Floor , .

, , 13th Residential, : ResidentialFloor 13thFloor.

+3

Source: https://habr.com/ru/post/1541420/


All Articles