Improving regular expression performance

I'm trying to determine the end of an English sentence (only approximately) by looking for "!", "?" or ".", but in the case of "." only when general abbreviations such as Mr. or Dr. do not prevail.

Is there a way to make the following regular expression even slightly more efficient? Perhaps by sorting negative lookbehind in descending order of size or even alphabetically?

Here is the regex that I have:

((?<!St|Sgt|Rev|Ltd|Inc|Lt|Jr|Sr|Esq|Inst|Hon|Gen|Cpl|Comdr|Col|Corp|Mr|Dr|Gov|Mrs|Ms|[A-Z]|Assn|Capt)(\.)|(!)|(\?))(\s*$|\s+([_$#]|[A-Z][^.]))

Problem:

The site at http://regex.powertoy.org/ reports: "7 matches of 21044 probes (completed)" even in a simple paragraph ... This outrageous size from the number 21044 seems closely related to the number of negative distortions.

I want to reduce the computational complexity for the RegEx engine, since I have several GB of data to go through it.

Is there any way to change this? Is a negative lookbehind the best / only way to achieve this? Is there any way to do this instead? Is regex the wrong tool for this task?

EDIT: I can use the ActionScript engine or PHP RegEx.

EDIT: I can't count on the number of spaces between sentences. Indeed!? Sigh.

Please do not answer if you do not have an understanding of the internal workings of the RegEx mechanism, which is related to optimization.

Thanks in advance.

+3
source share
3 answers

, . :

(?x:  # Allow spacing and comments
    (   
        (\.)    # First match "."
        (?<!    # Then negative-look-behind for titles followed by "."
            (?: St|Sgt|Rev|Ltd|Inc|Lt|Jr|Sr|Esq|Inst|Hon|Gen|Cpl|Comdr|Col|Corp|Mr|Dr|Gov|Mrs|Ms|[A-Z]|Assn|Capt)
            \.
        )
      |  (!)  
      |  (\?)
    )
    ( \s* $  |  \s+ ( [_$#] | [A-Z] [^.] ))
)

70000 2500 powertoy.org, . ( powertoy "x" - , , ).

, :

(?x:  # Allow spacing and comments
    (
        (\.)    # First match "."
        (?<!    # Then negative-look-behind for titles followed by "."
            (?:Assn|C(?:apt|ol|omdr|orp|pl)|Dr|Esq|G(?:en|ov)|Hon|I(?:nc|nst)|Jr|L(?:t|td)|M(?:[rs]|rs)|Rev|S(?:gt|[rt])|[A-Z])
            \.
        )
      |  (!)  
      |  (\?)
    )
    ( \s* $  |  \s+ ( [_$#] | [A-Z] [^.] ))
)

, 2000.

EDIT:
, , , look-behind ( , ) ( @Swiss ):

        (?<!   # Then negative-look-behind for titles followed by "."
               \b (?= [A-Z] )  # But first ensure we have a capital letter before going on
               (?:Assn|C(?:apt|ol|omdr|orp|pl)|Dr|Esq|G(?:en|ov)|Hon|I(?:nc|nst)|Jr|L(?:t|td)|M(?:[rs]|rs)|Rev|S(?:gt|[rt])|[A-Z])
            \.
        )
+4

. :

(?<!St|Sgt|Rev|Ltd|Inc|...|Capt)\.

... :

\.(?<!(?:St|Sgt|Rev|Ltd|Inc|...|Capt)\.)

, lookbehind, , . , 28,423 1,945. ( , Powertoy, .)

- (!)|(\?) - , .. ([!?]). , 1,344. , - , (?:...) (...). , .

:. , , - [A-Z]. , Trumps . /i, (?-i:...). \b ( ), @Swiss, . :

(?-i:\.(?<!\b(?:St|Sgt|Rev|Ltd|Inc|...|[A-Z]|Assn|Capt)\.)

... , [!?], 6 1404 Regex Powertoy.

+4

!

PHP ActionScript , , . , . , Regex Powertoy, , Java regex, lookbehinds.

Lookbehind - ; , . , Perl Python : (?<!St) , (?<!Sgt|Rev), (?<!St|Sgt) . Java ; lookbehind , , (?<!St|Sgt) , (?<!\w{3,12}), (?<!\w+) .

PHP ActionScript PCRE, , Perl Python lookbehinds. lookbehind , , . "" , .. Lookbehind .

, . , ., ! ? [.!?] \b \. ,

/([.!?])(?<!\bSt\.|\bSgt\.|\bRev\.|\bLtd\.|\bInc\.|\bLt\.|\bJr\.|\bSr\.|\bEsq\.|\bInst\.|\bHon\.|\bGen\.|\bCpl\.|\bComdr\.|\bCol\.|\bCorp\.|\bMr\.|\bDr\.|\bGov\.|\bMrs\.|\bMs\.|\b[A-Z]\.|\bAssn\.|\bCapt\.)(?:\s*$|\s+(?:[_$#]|[A-Z][^.]))/

2208, - , . GB , .

EDIT: , . , , ( ), , , . :

/(?:
   \.
     (?<!\bComdr\.)
     (?<!(?=\b[A-Z])(?:Assn|C(?:apt|orp)|Inst)\.)
     (?<!(?=\b[A-Z])(?:C(?:ol|pl)|Esq|G(?:en|ov)|Hon|Inc|Ltd|Mrs|Rev|Sgt)\.)
     (?<!(?=\b[A-Z])(?:Dr|Jr|Lt|M[rs]|S[rt])\.)
     (?<!\b[A-Z]\.)
   |
   [!?]
 )
 (?:\s*$|\s+(?:[_$#]|[A-Z][^.]))
/x

, PHP ideone.com.

+2

Source: https://habr.com/ru/post/1770236/


All Articles