Why is this Regexp taking 99.89% less steps using pcre and not Python?

I just created this expression in the regex101 editor, but accidentally forgot to switch it to Python flavor syntax. I am not familiar with the differences, but decided that they would be pretty minor. They are not.

Perl/pcretakes 99.89% fewer steps than Python(6 377 715 vs 6565 steps)

https://regex101.com/r/PRwtJY/3

Regexp:

^(\d{1,3}) +((?:[a-zA-Z0-9\(\)\-≠,]+ )+) +£ *((?:[\d]  {1,4}|\d)+)∑([ \d]+)?

Any help would be appreciated! Thank.

EDIT

The data source is multi-line txt extracted from PDF, resulting in a less perfect result (you can see the basic source of PDF here )

I am trying to extract the box numbers, header and any number that is present (filled) for specific lines. If you check the link above, you will see a complete sample. For instance:

Below is a screenshot of Regex101 showing positive matches. The top linear match displays the window number (155), title (trading profit) and number (5561). Syntax example

Limitations:

  • It’s ideal to extract the values ​​as you see them in the PCRE compiler - with little or no extra space before or after the match - only the window number, name and value.
  • Only a match if the number / value is filled (for example, 5561 in the example above, therefore, does not correspond to the line immediately after it - field 160, but corresponds to line 165).
  • , , .
+4
1

: regex module, . 50% PCRE (. regex101.com):

^
(\d{1,3})\s++
((?>[^£\n]+))£\s++
([ \d]+)(?>[^∑\n]+)∑\s++
([ \d]+)


, :
import regex as re
rx = re.compile(r'''
    ^
    (\d{1,3})\s++
    ((?>[^£\n]+))£\s++
    ([ \d]+)(?>[^∑\n]+)∑\s++
    ([ \d]+)''', re.M | re.X)

matches = [[group.strip() for group in m.groups()] for m in rx.finditer(data)]
print(matches)

:

[['145', 'Total turnover from trade', '5    2    0  0  0', '0  0'], ['155', 'Trading profits', '5  5  6  1', '0  0'], ['165', 'Net trading profits ≠ box 155 minus box 160', '5    5  6  1', '0  0'], ['235', 'P rofits before other deductions and reliefs ≠ net sum of', '5  5  6  1', '0  0'], ['300', 'Profits before qualifying donations and group relief ≠', '5  5    6  1', '0     0'], ['315', 'Profits chargeable to Corporation Tax ≠', '5  5    6  1', '0     0'], ['475', 'Net Corporation Tax liability ≠ box 440 minus box 470', '1  0  5  6', '5  9'], ['510', 'Tax chargeable ≠ total of boxes 475, 480, 500 and 505', '1  0  5  6', '5  9'], ['525', 'Self-assessment of tax payable ≠ box 510 minus box 515', '1  0  5  6', '5  9'], ['600', 'Tax outstanding ≠', '1  0  5  6', '5  9']]
+1

Source: https://habr.com/ru/post/1695589/


All Articles