Regular Expression Combinatorial Product

Question

Regular Expression Combinatorial Product

I am trying to create string variations using optional substitution.

For example, a single substitution pattern deletes any sequence of empty characters. Instead of replacing all occurrences such as

>>> re.sub(r'\s+', '', 'ab c') 'abc'

- I need, instead, the two options that should be created for each case, in that the substitution is performed in one embodiment, but not in the other. For the string 'ab c' I want to have options

 ['ab c', 'a bc', 'ab c', 'abc']

i.e., the cross product of all binary solutions (the result obviously includes the original string).

In this case, options can be obtained using re.finditer and itertools.product :

 def vary(target, pattern, subst): occurrences = [m.span() for m in pattern.finditer(target)] for path in itertools.product((True, False), repeat=len(occurrences)): variant = '' anchor = 0 for (start, end), apply_this in zip(occurrences, path): if apply_this: variant += target[anchor:start] + subst anchor = end variant += target[anchor:] yield variant

In this example, you will get the desired result:

 >>> list(vary('ab c', re.compile(r'\s+'), '')) ['abc', 'ab c', 'a bc', 'ab c']

However, this solution only works for fixed-line fixes. Extended functions from re.sub as links to groups cannot be performed as in the following example to insert a space after a sequence of numbers inside a word:

 re.sub(r'\B(\d+)\B'), r'\1 ', 'abc123def')

How can I extend or change the approach to accept any valid re.sub argument (without writing a parser to interpret group references)?

+5

python regex

lenz Jan 6 '16 at 16:28

source share

2 answers

How about this:

 def vary(target, pattern, subst): numOccurences = len (pattern.findall (target)) for path in itertools.product((True, False), repeat=numOccurences): variant = '' remainingStr = target for currentFlag in path: if currentFlag: remainingStr = pattern.sub (subst, remainingStr, 1) else: currentMatch = pattern.search (remainingStr); variant += remainingStr[:currentMatch.end ()] remainingStr = remainingStr[currentMatch.end ():] variant += remainingStr yield variant

For each match, we either let re.sub () do its job (counter 1 stops after one substitution), or we take out the unchanged part of the line.

Try it with your own examples

 target = 'abc' pattern = re.compile(r'\s+') subst = '' print list (vary(target, pattern, subst)) target = 'abc123def' pattern = re.compile(r'\B(\d+)\B') subst = r'\1 ' print list (vary(target, pattern, subst))

I get

 ['abc', 'ab c', 'a bc', 'ab c'] ['abc123 def', 'abc123def']

+1

Thorngardso Jan 6 '16 at 18:37

source share

B98 · Accepted Answer · 2016-01-06T19:14:36+0000

Thinking that subst called, which accesses the matching data, finally made me learn about MatchObject.expand . So, as an approximation, with subst remaining string r ,

 def vary(target, pattern, subst): matches = [m for m in pattern.finditer(target)] occurrences = [m.span() for m in matches] for path in itertools.product((True, False), repeat=len(occurrences)): variant = '' anchor = 0 for match, (start, end), apply_this in zip(matches, occurrences, path): if apply_this: variant += target[anchor:start] + match.expand(subst) anchor = end variant += target[anchor:] yield variant

I am not sure, however, that this covers all the necessary flexibility when referring to a topic, gaining an appropriate fit. An indexed set of separation string power came to mind, but I think not far from the analyzer mentioned.

Regular Expression Combinatorial Product

More articles: