How do you translate this regular expression idiom from Perl to Python?

Question

How do you translate this regular expression idiom from Perl to Python?

I switched from Perl to Python about a year ago and did not look back. There is only one idiom I have ever discovered that I can do more easily in Perl than in Python:

if ($var =~ /foo(.+)/) { # do something with $1 } elsif ($var =~ /bar(.+)/) { # do something with $1 } elsif ($var =~ /baz(.+)/) { # do something with $1 }

The corresponding Python code is not so elegant as if statements preserve nesting:

 m = re.search(r'foo(.+)', var) if m: # do something with m.group(1) else: m = re.search(r'bar(.+)', var) if m: # do something with m.group(1) else: m = re.search(r'baz(.+)', var) if m: # do something with m.group(2)

Does anyone have an elegant way to reproduce this pattern in Python? I saw anonymous function dispatch tables, but they seem cumbersome to me for a small number of regular expressions ...

+45

python regex perl

Dan Lenski Sep 23 '08 at 16:55

source share

14 answers

Using the named groups and sending table:

 r = re.compile(r'(?P<cmd>foo|bar|baz)(?P<data>.+)') def do_foo(data): ... def do_bar(data): ... def do_baz(data): ... dispatch = { 'foo': do_foo, 'bar': do_bar, 'baz': do_baz, } m = r.match(var) if m: dispatch[m.group('cmd')](m.group('data'))

With a little introspection, you can automatically generate a regular expression and a send table.

+18

Thomas Wouters Sep 23 '08 at 16:58

source share

Yes, this is a little annoying. Perhaps this will work for your case.

 import re class ReCheck(object): def __init__(self): self.result = None def check(self, pattern, text): self.result = re.search(pattern, text) return self.result var = 'bar stuff' m = ReCheck() if m.check(r'foo(.+)',var): print m.result.group(1) elif m.check(r'bar(.+)',var): print m.result.group(1) elif m.check(r'baz(.+)',var): print m.result.group(1)

EDIT: Brian correctly pointed out that my first attempt did not work. Unfortunately, this attempt is longer.

+10

Pat Notz Sep 23 '08 at 19:04

source share

 r""" This is an extension of the re module. It stores the last successful match object and lets you access it methods and attributes via this module. This module exports the following additional functions: expand Return the string obtained by doing backslash substitution on a template string. group Returns one or more subgroups of the match. groups Return a tuple containing all the subgroups of the match. start Return the indices of the start of the substring matched by group. end Return the indices of the end of the substring matched by group. span Returns a 2-tuple of (start(), end()) of the substring matched by group. This module defines the following additional public attributes: pos The value of pos which was passed to the search() or match() method. endpos The value of endpos which was passed to the search() or match() method. lastindex The integer index of the last matched capturing group. lastgroup The name of the last matched capturing group. re The regular expression object which as passed to search() or match(). string The string passed to match() or search(). """ import re as re_ from re import * from functools import wraps __all__ = re_.__all__ + [ "expand", "group", "groups", "start", "end", "span", "last_match", "pos", "endpos", "lastindex", "lastgroup", "re", "string" ] last_match = pos = endpos = lastindex = lastgroup = re = string = None def _set_match(match=None): global last_match, pos, endpos, lastindex, lastgroup, re, string if match is not None: last_match = match pos = match.pos endpos = match.endpos lastindex = match.lastindex lastgroup = match.lastgroup re = match.re string = match.string return match @wraps(re_.match) def match(pattern, string, flags=0): return _set_match(re_.match(pattern, string, flags)) @wraps(re_.search) def search(pattern, string, flags=0): return _set_match(re_.search(pattern, string, flags)) @wraps(re_.findall) def findall(pattern, string, flags=0): matches = re_.findall(pattern, string, flags) if matches: _set_match(matches[-1]) return matches @wraps(re_.finditer) def finditer(pattern, string, flags=0): for match in re_.finditer(pattern, string, flags): yield _set_match(match) def expand(template): if last_match is None: raise TypeError, "No successful match yet." return last_match.expand(template) def group(*indices): if last_match is None: raise TypeError, "No successful match yet." return last_match.group(*indices) def groups(default=None): if last_match is None: raise TypeError, "No successful match yet." return last_match.groups(default) def groupdict(default=None): if last_match is None: raise TypeError, "No successful match yet." return last_match.groupdict(default) def start(group=0): if last_match is None: raise TypeError, "No successful match yet." return last_match.start(group) def end(group=0): if last_match is None: raise TypeError, "No successful match yet." return last_match.end(group) def span(group=0): if last_match is None: raise TypeError, "No successful match yet." return last_match.span(group) del wraps # Not needed past module compilation

For example:

 if gre.match("foo(.+)", var): # do something with gre.group(1) elif gre.match("bar(.+)", var): # do something with gre.group(1) elif gre.match("baz(.+)", var): # do something with gre.group(1)

+10

Markus Jarderot Sep 25 '08 at 20:10

source share

I would suggest this since it uses the smallest regex to achieve your goal. This is still functional code, but no worse than your old Perl.

 import re var = "barbazfoo" m = re.search(r'(foo|bar|baz)(.+)', var) if m.group(1) == 'foo': print m.group(1) # do something with m.group(1) elif m.group(1) == "bar": print m.group(1) # do something with m.group(1) elif m.group(1) == "baz": print m.group(2) # do something with m.group(2)

+9

Jack M. Sep 23 '08 at 21:50

source share

Alternatively, something doesn't use regular expressions at all:

 prefix, data = var[:3], var[3:] if prefix == 'foo': # do something with data elif prefix == 'bar': # do something with data elif prefix == 'baz': # do something with data else: # do something with var

What is appropriate depends on your real problem. Remember that regular expressions are not the Swiss Army knife they are on Perl; Python has various constructs for performing string manipulations.

+4

Thomas Wouters Sep 23 '08 at 17:07

source share

 def find_first_match(string, *regexes): for regex, handler in regexes: m = re.search(regex, string): if m: handler(m) return else: raise ValueError find_first_match( foo, (r'foo(.+)', handle_foo), (r'bar(.+)', handle_bar), (r'baz(.+)', handle_baz))

To speed it up, you could turn all the regular expressions into one internal and create a dispatcher on the fly. Ideally, then it will be turned into a class.

+4

Torsten Marek Sep 23 '08 at 17:11

source share

Here is how I solved this problem:

 matched = False; m = re.match("regex1"); if not matched and m: #do something matched = True; m = re.match("regex2"); if not matched and m: #do something else matched = True; m = re.match("regex3"); if not matched and m: #do yet something else matched = True;

Not as clean as the original drawing. However, it is simple, simple, and does not require additional modules or that you change the original regular expressions.

+3

Daniel Bingham Jan 07

source share

how about using a dictionary?

 match_objects = {} if match_objects.setdefault( 'mo_foo', re_foo.search( text ) ): # do something with match_objects[ 'mo_foo' ] elif match_objects.setdefault( 'mo_bar', re_bar.search( text ) ): # do something with match_objects[ 'mo_bar' ] elif match_objects.setdefault( 'mo_baz', re_baz.search( text ) ): # do something with match_objects[ 'mo_baz' ] ...

however, you must make sure that there are no two-word match_objects dictionary keys (mo_foo, mo_bar, ...), it’s best to give each regular expression its own name and name the corresponding match_objects keys accordingly, otherwise the match_objects.setdefault () method will return the existing matching object instead of creating a new matching object by running re_xxx.search (text).

+1

Matus Nov 16 2018-10-16

source share

Turning around on the Pat Notz solution a bit, I found it even more elegant:
- Name the methods the same as re (e.g. search() vs. check() ) and
- execute the necessary methods, for example group() , on the holder object itself:

 class Re(object): def __init__(self): self.result = None def search(self, pattern, text): self.result = re.search(pattern, text) return self.result def group(self, index): return self.result.group(index)

Example

Instead, for example, this:

 m = re.search(r'set ([^ ]+) to ([^ ]+)', line) if m: vars[m.group(1)] = m.group(2) else: m = re.search(r'print ([^ ]+)', line) if m: print(vars[m.group(1)]) else: m = re.search(r'add ([^ ]+) to ([^ ]+)', line) if m: vars[m.group(2)] += vars[m.group(1)]

Only this is done:

 m = Re() ... if m.search(r'set ([^ ]+) to ([^ ]+)', line): vars[m.group(1)] = m.group(2) elif m.search(r'print ([^ ]+)', line): print(vars[m.group(1)]) elif m.search(r'add ([^ ]+) to ([^ ]+)', line): vars[m.group(2)] += vars[m.group(1)]

It looks very natural at the end, does not need too many code changes when switching from Perl and avoids problems with global state, like some other solutions.

+1

Yirkha Aug 09 '16 at 11:11

source share

Minimalist DataHolder:

 class Holder(object): def __call__(self, *x): if x: self.x = x[0] return self.x data = Holder() if data(re.search('foo (\d+)', string)): print data().group(1)

or as a singleton function:

 def data(*x): if x: data.x = x[0] return data.x

+1

Mike Robins Jun 30 '17 at 0:58

source share

My decision:

 import re class Found(Exception): pass try: for m in re.finditer('bar(.+)', var): # Do something raise Found for m in re.finditer('foo(.+)', var): # Do something else raise Found except Found: pass

0

Mike Robins Jun 12 '15 at 9:33

source share

Here is the RegexDispatcher class that submits its subclass methods with a regular expression.

Each dispatch method is annotated with a regular expression, for example.

 def plus(self, regex: r"\+", **kwargs): ...

In this case, the annotation is called "regex", and its value is a regular expression for matching, "\ +", which is the + sign. These annotated methods are placed in subclasses, not in the base class.

When the submit method (...) is called on a line, the class finds a method with a regular expression annotation that matches the line and calls it. Here is the class:

 import inspect import re class RegexMethod: def __init__(self, method, annotation): self.method = method self.name = self.method.__name__ self.order = inspect.getsourcelines(self.method)[1] # The line in the source file self.regex = self.method.__annotations__[annotation] def match(self, s): return re.match(self.regex, s) # Make it callable def __call__(self, *args, **kwargs): return self.method(*args, **kwargs) def __str__(self): return str.format("Line: %s, method name: %s, regex: %s" % (self.order, self.name, self.regex)) class RegexDispatcher: def __init__(self, annotation="regex"): self.annotation = annotation # Collect all the methods that have an annotation that matches self.annotation # For example, methods that have the annotation "regex", which is the default self.dispatchMethods = [RegexMethod(m[1], self.annotation) for m in inspect.getmembers(self, predicate=inspect.ismethod) if (self.annotation in m[1].__annotations__)] # Be sure to process the dispatch methods in the order they appear in the class! # This is because the order in which you test regexes is important. # The most specific patterns must always be tested BEFORE more general ones # otherwise they will never match. self.dispatchMethods.sort(key=lambda m: m.order) # Finds the FIRST match of s against a RegexMethod in dispatchMethods, calls the RegexMethod and returns def dispatch(self, s, **kwargs): for m in self.dispatchMethods: if m.match(s): return m(self.annotation, **kwargs) return None

To use this class, subclass it to create a class with annotated methods. As an example, here is a simple RPNCalculator that inherits from RegexDispatcher. The methods to be sent are (of course) those that contain the regex annotation. The parent dispatch () method is called in the call .

 from RegexDispatcher import * import math class RPNCalculator(RegexDispatcher): def __init__(self): RegexDispatcher.__init__(self) self.stack = [] def __str__(self): return str(self.stack) # Make RPNCalculator objects callable def __call__(self, expression): # Calculate the value of expression for t in expression.split(): self.dispatch(t, token=t) return self.top() # return the top of the stack # Stack management def top(self): return self.stack[-1] if len(self.stack) > 0 else [] def push(self, x): return self.stack.append(float(x)) def pop(self, n=1): return self.stack.pop() if n == 1 else [self.stack.pop() for n in range(n)] # Handle numbers def number(self, regex: r"[-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?", **kwargs): self.stack.append(float(kwargs['token'])) # Binary operators def plus(self, regex: r"\+", **kwargs): a, b = self.pop(2) self.push(b + a) def minus(self, regex: r"\-", **kwargs): a, b = self.pop(2) self.push(b - a) def multiply(self, regex: r"\*", **kwargs): a, b = self.pop(2) self.push(b * a) def divide(self, regex: r"\/", **kwargs): a, b = self.pop(2) self.push(b / a) def pow(self, regex: r"exp", **kwargs): a, b = self.pop(2) self.push(a ** b) def logN(self, regex: r"logN", **kwargs): a, b = self.pop(2) self.push(math.log(a,b)) # Unary operators def neg(self, regex: r"neg", **kwargs): self.push(-self.pop()) def sqrt(self, regex: r"sqrt", **kwargs): self.push(math.sqrt(self.pop())) def log2(self, regex: r"log2", **kwargs): self.push(math.log2(self.pop())) def log10(self, regex: r"log10", **kwargs): self.push(math.log10(self.pop())) def pi(self, regex: r"pi", **kwargs): self.push(math.pi) def e(self, regex: r"e", **kwargs): self.push(math.e) def deg(self, regex: r"deg", **kwargs): self.push(math.degrees(self.pop())) def rad(self, regex: r"rad", **kwargs): self.push(math.radians(self.pop())) # Whole stack operators def cls(self, regex: r"c", **kwargs): self.stack=[] def sum(self, regex: r"sum", **kwargs): self.stack=[math.fsum(self.stack)] if __name__ == '__main__': calc = RPNCalculator() print(calc('2 2 exp 3 + neg')) print(calc('c 1 2 3 4 5 sum 2 * 2 / pi')) print(calc('pi 2 * deg')) print(calc('2 2 logN'))

I like this solution because there are no separate lookup tables. A regular expression for matching is built into the method that will be called as the annotation. This is as it should be for me. It would be nice if Python allowed for more flexible annotations, because I would prefer to add a comment to the regular expression method only to the method itself, rather than pasting it into the method parameter list. However, this is currently not possible.

For fun, take a look at the Tungsten language, in which functions are polymorphic on arbitrary patterns, and not just on argument types. A function that is polymorphic in a regular expression is a very powerful idea, but we cannot get it in Python. The RegexDispatcher class is the best I could do.

0

Jim Arlow Aug 21 '17 at 13:30

source share

Starting with Python 3.8 and introducing assignment expressions (PEP 572) ( := operator), we can now capture the re.search(pattern, text) condition value in a match variable to check if it is a None value and then reuse it in body conditions:

 if match := re.search(r'foo(.+)', text): # do something with match.group(1) elif match := re.search(r'bar(.+)', text): # do something with match.group(1) elif match := re.search(r'baz(.+)', text) # do something with match.group(2)

0

Xavier Guihot Apr 28 '19 at 6:06

source share

Craig McQueen · Accepted Answer · 2009-11-27 01:05

Thanks to this other SO question :

 import re class DataHolder: def __init__(self, value=None, attr_name='value'): self._attr_name = attr_name self.set(value) def __call__(self, value): return self.set(value) def set(self, value): setattr(self, self._attr_name, value) return value def get(self): return getattr(self, self._attr_name) string = u'test bar 123' save_match = DataHolder(attr_name='match') if save_match(re.search('foo (\d+)', string)): print "Foo" print save_match.match.group(1) elif save_match(re.search('bar (\d+)', string)): print "Bar" print save_match.match.group(1) elif save_match(re.search('baz (\d+)', string)): print "Baz" print save_match.match.group(1)

How do you translate this regular expression idiom from Perl to Python?

Example

More articles: