Perl Regular Expression Weighted Disjunction?

I am pretty good at regular expressions, but I am having difficulty with the current application, including disjunction.

My situation is this: I need to divide the address into its component parts based on the correspondence of the regular expression on the "Identification elements" of the address. A comparable English example could be words like "state", "road", or "boulevard" - for example, we wrote them in our addresses. Imagine we have an address similar to the following, where (and this will never happen in English), we indicated the type of identifier after each name

United States COUNTRY California STATE San Francisco CITY Mission STREET 345 NUMBER

(Where words in CAPS are what I called "identifiers").

We want to parse it into:
United States COUNTRY
California STATE
San Francisco CITY
Mission STREET
245 NUMBER

Well, this is probably far-fetched for English, but here's the catch: I work with Chinese data, where in fact this style of identifier specification happens all the time. Example below:

云南-省 ; 丽江-市 ; 古城-区 ; 西安-街 ; 杨春-巷 ; Yunnan-Province ; LiJiang-City ; GuCheng-District ; Xi'An-Street ; Yangchun-Alley

This is quite simple - a lazy coincidence of potential names of candidate identifiers, divided into a disjunctive list.

For China, the following provincial-level entities:

省 (Province) , 自治区 (Autonomous Region) , 市 (Municipality)

So my regex looks like this:

(.+?(?:(?:省)|(?:自治区)|(?:市)))

I have a series of them to account for different parts of the address. Next level corresponding to cities, for example:

(.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))

So, to match the provincial property, followed by the urban property:

(.+?(?:(?:省)|(?:自治区)|(?:市)))(.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))

With the named capture groups:
(?<Province>.+?(?:(?:省)|(?:自治区)|(?:市)))(?<City>.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))

For the above, this gives:
$+{Province} = 云南省
$+{City} = 丽江市

It's all good and good, and I'm getting pretty far. The problem, however, is that I am trying to account for identifiers, which can be a substring of other identifiers. For example, a single street-level organization is 村委会, which means a village organizing committee. In the set of addresses that I want to split, not every address has a full description. In fact, I find "村委" and just "村".

Problem? If I have a pure disjunction of these elements, we have the following:

(?<Street>.+?(?:(?:村委会)|(?:村委)|(?:村)))

What happens, however, if you have an organization 保定 - 村委会 (organizing committee of the Baoding Village), this lazy regular expression stops at 村 and calls it the day orphaned by our poor 委会 because 村 is one of the potential disjunctive elements.

Imagine the English equivalent as follows:
(?<Animal>.+?(?:(?:Cat)|(?:Elephant)|(?:CatElephant)|(?:City)))

We have two input lines:
1. “Crap Catelephant crap city”, where we wanted “Crap catelephant” and “crap city”, 2. “Shitty Catalan town”, where we wanted “shitty cat” “elephant city”

And, the solution, you say, is to make the harvest a pre-identifier. But! Entities have the same identifier, which is not at the same level.

Take 市, for example. It simply means "city." But in China there are cities of county, provincial and municipal levels. If this character occurred twice in a string, especially in two adjacent entities, a greedy search would incorrectly mark the greedy match as the first object. As in the following:

广东-省 ; 江门-市 ; 开平-市 ; 三埠-区 石海管-区
Guangdong-province ; Jiangmen-City ; Kaiping-City ; Sanbu-District ; Shihaiguan-District

(Note that, as stated above, this was split into hands. Raw data would just have a string of concatenated characters)

Match for greedy search will be 江门市开平市

This is incorrect, since two adjacent objects must be divided into their component parts. Once at a provincial town level, one is a county-level town.

Back to the starting point, and I thank you for reading this far, is there a way to put weighting on disjunctive entities? I would like the regular expression to find the highest “weighted” identifier. 村委会 instead of a simple 村, for example, “catelephant” instead of “cat”. In preliminary experiments, the regular expression analyzer seems to continue from left to right in search of disjunctive matches. Is this a correct guess? Should I put the most common identifiers first on the disjunctive list?

If I lost anyone with parts related to China, I apologize and can clarify if necessary. The example really should not be Chinese - I think that more generally it is a question of the mechanism of disjunctive correspondence of regular expressions - in what order does he prefer disjunctive entities and how does he decide when to call this "day" in the context of lazy search?

In a sense, is there some kind of intermediate point between lazy and greedy searches? Find the smallest bit you can find to the longest / highest weighted disjunctive object? Be lazy, but take a little effort if you can, for the sake of thoroughness? (By the way, my philosophy of working in college?)

+4
source share
1 answer

How variables are handled based on a particular regular expression engine . For almost all engines (including the Perl regular expression engine) alternation looks impatient - that is, it first matches the left-most choice and only tries another alternative if it fails. For example, if you have /(cat|catelephant)/ , it will never match catelephant . The solution is to change the selection order so that the most specific is first.

+8
source

Source: https://habr.com/ru/post/1337527/


All Articles