Python regex doesn't match http: //

Question

Python regex doesn't match http: //

I ran into the problem of matching and replacing certain words not contained in http: //

Regex View:

http://.*?\s+

This matches the pattern http://www.egg1.com http://www.egg2.com

I need a regular expression to match certain words contained outside http: //

Example:

 "This is a sample. http://www.egg1.com and http://egg2.com. This regex will only match this egg1 and egg2 and not the others contained inside http:// " Match: egg1 egg2 Replaced: replaced1 replaced2

Final result:

  "This is a sample. http://www.egg1.com and http://egg2.com. This regex will only match this replaced1 and replaced2 and not the others contained inside http:// "

Question: You need to map specific patterns (as in the example: egg1 egg2) if they are not part of http: //. Do not match egg1 and egg2 if present within http: //

+6

python regex regex-negation

thinkcool Jul 28 '11 at 13:31

source share

4 answers

This will not capture http://... :

 (?:http://.*?\s+)|(egg1)

+2

Karolis Jul 28 '11 at 14:13

source share

You need to precede your template with a negative lookbehind statement :

 (?<!http://)egg[0-9]

In this regular expression, every time the regex engine finds a pattern matching egg[0-9] , it will look back to see if the previous http:// patterns match. A negative lookbehind statement begins with (?<! And ends with ) . Everything between these delimiters should not precede the following pattern and will not be included in the result.

How to use it in your case:

 >>> regex = re.compile('(?<!http://)egg[0-9]') >>> a = "Example: http://egg1.com egg2 http://egg3.com egg4foo" >>> regex.findall(a) ['egg2', 'egg4']

+1

brandizzi Jul 28 '11 at 13:42

source share

Extending brandizzi's answer, I would just change its regular expression:

 (?<!http://[\w\._-]*)(egg1|egg2)

-2

Mike Jul 28 '11 at 13:58

source share

Ferdinand beyer · Accepted Answer · 2011-07-28T13:47:19+0000

One solution I can think of is to create a combined pattern for the HTTP URLs and your pattern, and then filter the matches accordingly:

 import re t = "http://www.egg1.com http://egg2.com egg3 egg4" p = re.compile('(http://\S+)|(egg\d)') for url, egg in p.findall(t): if egg: print egg

prints:

  egg3
 egg4

UPDATE: To use this idiom with re.sub() , just set the filter function:

 p = re.compile(r'(http://\S+)|(egg(\d+))') def repl(match): if match.group(2): return 'spam{0}'.format(match.group(3)) return match.group(0) print p.sub(repl, t)

prints:

  http://www.egg1.com http://egg2.com spam3 spam4

Python regex doesn't match http: //

More articles: