Regex, find the pattern only in the middle of the line

Question

Regex, find the pattern only in the middle of the line

I am using python 2.6 and trying to find a bunch of duplicate characters in a string, say a bunch of n , for example. nnnnnnnABCnnnnnnnnnDEF . Anywhere in a string, n can be a variable.

If I create a regex like this:

re.findall(r'^(((?i)n)\2{2,})', s) ,

I can find cases of case insensitivity n only at the beginning of the line, which is good. If I do it like this:

re.findall(r'(((?i)n)\2{2,}$)', s) ,

I can only find them at the end of the sequence. But what about in the middle?

At first I thought of using re.findall(r'(((?i)n)\2{2,})', s) and the two previous regular expressions (-ices?) To check the length of the returned list and the presence of n either in beginning or at the end of the line and do logical tests, but it became very ugly, if-still mess very quickly.

Then I tried re.findall(r'(?!^)(((?i)n)\2{2,})', s) , which seems to exclude the beginning just fine, but (?!$) or (?!\z) at the end of the regular expression excludes only the last n in ABCnnnn . Finally, I tried re.findall(r'(?!^)(((?i)n)\2{2,})\w+', s) , which seems to work sometimes, but I get weird results in others. It seems to me that I need a look or a look, but I can’t circle my head around me.

+5

python regex

Dima1982 Feb 25 '16 at 9:17

source share

3 answers

Since "n" is a character (not a subpattern), you can simply use:

 re.findall(r'(?<=[^n])nn+(?=[^n])(?i)', s)

or better:

 re.findall(r'n(?<=[^n]n)n+(?=[^n])(?i)', s)

+2

Casimir et Hippolyte Feb 25 '16 at 9:39

source share

NOTE : This solution assumes that n may be a sequence of some characters. For more efficient alternatives when n just 1 character, see Other Answers here.

you can use

 (?<!^)(?<!n)((n)\2{2,})(?!$)(?!n)

Watch the regex demo

The regular expression will correspond to the repetition of consecutive n (ignoring can be achieved using the re.I flag), which are not at the beginning ( (?<!^) ) Or end ( (?!$) ) Of the line, but not earlier ( (?!n) ) or after ( (?<!n) ) another n .

(?<!^)(?<!n) is a sequence of 2 lookbehind: (?<!^) means that the following pattern should not be used if it is preceded by the beginning of a line. A negative lookbehind (?<!n) means that the following pattern should not be used if it is preceded by n . Negative images (?!$) And (?!n) have the same values: (?!$) Does not match if the current line ends after the current position, and (?!n) does not match if n happens after the current position in a string (that is, immediately after matching all consecutive n s. All search conditions must be met, so we get only the most secret matches.

See the IDEONE demo :

 import re p = re.compile(r'(?<!^)(?<!n)((n)\2{2,})(?!$)(?!n)', re.IGNORECASE) s = "nnnnnnnABCnnnnnNnnnnDEFnNn" print([x.group() for x in p.finditer(s)])

+1

Wiktor stribiżew Feb 25 '16 at 9:33

source share

Kasramvd · Accepted Answer · 2016-02-25T09:32:41+0000

Instead of using a complex regex to avoid matching up and down n characters. As a more pythonic you can strip() specify your string, then find the whole sequence n with re.findall() and a simple regular expression:

 >>> s = "nnnABCnnnnDEFnnnnnGHInnnnnn" >>> import re >>> >>> re.findall(r'n{2,}', s.strip('n'), re.I) ['nnnn', 'nnnnn']

Note. re.I is an Ignore-case flag that forces the regex engine to match lowercase and lowercase characters.

Regex, find the pattern only in the middle of the line

More articles: