A more elegant way to implement regexp-like quantifiers

I am writing a simple string parser that allows you to use quantifiers similar to regexp. The input line might look like this:

s = "xy{1,2} z" 

My parser function translates this line into a list of tuples:

 list_of_tuples = [("x", 1, 1), ("y", 1, 2), ("z", 1, 1)] 

Now the tricky bit is that I need a list of all valid combinations that are determined by quantification. All combinations must have the same number of elements, and the value None used to fill. For this example, the expected result:

 [["x", "y", None, "z"], ["x", "y", "y", "z"]] 

I have a working solution, but I'm not very happy with it: it uses two nested for , and I find the code somewhat obscure, so something is usually inconvenient and awkward there:

 import itertools def permute_input(lot): outer = [] # is there something that replaces these nested loops? for val, start, end in lot: inner = [] # For each tuple, create a list of constant length # Each element contains a different number of # repetitions of the value of the tuple, padded # by the value None if needed. for i in range(start, end + 1): x = [val] * i + [None] * (end - i) inner.append(x) outer.append(inner) # Outer is now a list of lists. final = [] # use itertools.product to combine the elements in the # list of lists: for combination in itertools.product(*outer): # flatten the elements in the current combination, # and append them to the final list: final.append([x for x in itertools.chain.from_iterable(combination)]) return final print(permute_input([("x", 1, 1), ("y", 1, 2), ("z", 1, 1)])) [['x', 'y', None, 'z'], ['x', 'y', 'y', 'z']] 

I suspect there is a much more elegant way to do this, perhaps hidden somewhere in the itertools module?

+5
source share
3 answers

One alternative way to solve the problem is to use pyparsing and an example of regex parsing that would extend the regex to possible matching lines. For your sample string xy{1,2} z it generates two possible strings that extend the quantifier:

 $ python -i regex_invert.py >>> s = "xy{1,2} z" >>> for item in invert(s): ... print(item) ... xyz x yy z 

The repetition itself supports both the open range and the closed range and is defined as:

 repetition = ( (lbrace + Word(nums).setResultsName("count") + rbrace) | (lbrace + Word(nums).setResultsName("minCount") + "," + Word(nums).setResultsName("maxCount") + rbrace) | oneOf(list("*+?")) ) 

To get the desired result, we need to change the way we get results from the recurseList generator and return lists instead of strings:

 for s in elist[0].makeGenerator()(): for s2 in recurseList(elist[1:]): yield [s] + [s2] # instead of yield s + s2 

Then we only need to smooth out the result :

 $ ipython3 -i regex_invert.py In [1]: import collections In [2]: def flatten(l): ...: for el in l: ...: if isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes)): ...: yield from flatten(el) ...: else: ...: yield el ...: In [3]: s = "xy{1,2} z" In [4]: for option in invert(s): ...: print(list(flatten(option))) ...: ['x', ' ', 'y', None, ' ', 'z'] ['x', ' ', 'y', 'y', ' ', 'z'] 

Then, if necessary, you can filter out whitespace characters:

 In [5]: for option in invert(s): ...: print([item for item in flatten(option) if item != ' ']) ...: ['x', 'y', None, 'z'] ['x', 'y', 'y', 'z'] 
+6
source

The part generating different lists based on the tuple can be written using the list:

 outer = [] for val, start, end in lot: # For each tuple, create a list of constant length # Each element contains a different number of # repetitions of the value of the tuple, padded # by the value None if needed. outer.append([[val] * i + [None] * (end - i) for i in range(start, end + 1)]) 

(All of this will again be written with a list, but this makes the code more difficult to read IMHO).

On the other hand, a list comprehension in [x for x in itertools.chain.from_iterable(combination)] can be written in a more concise way. In fact, the whole point is to create the actual list from the iterable. This can be done using list(itertools.chain.from_iterable(combination)) . An alternative would be to use the built-in sum . I'm not sure which is better.

Finally, the final.append part can be written with a list.

 # use itertools.product to combine the elements in the list of lists: # flatten the elements in the current combination, return [sum(combination, []) for combination in itertools.product(*outer)] 

The final code based only on the written code is slightly reorganized:

 outer = [] for val, start, end in lot: # For each tuple, create a list of constant length # Each element contains a different number of # repetitions of the value of the tuple, padded # by the value None if needed. outer.append([[val] * i + [None] * (end - i) for i in range(start, end + 1)]) # use itertools.product to combine the elements in the list of lists: # flatten the elements in the current combination, return [sum(combination, []) for combination in itertools.product(*outer)] 
+2
source

A recursive solution (simple, useful for several thousand tuples):

 def permutations(lot): if not lot: yield [] else: item, start, end = lot[0] for prefix_length in range(start, end+1): for perm in permutations(lot[1:]): yield [item]*prefix_length + [None] * (end - prefix_length) + perm 

It is limited by the recursion depth (~ 1000). If this is not enough, there is a simple optimization for start == end cases. Dependin for the expected size of list_of_tuples might be enough

Test:

 >>> list(permutations(list_of_tuples)) # list() because it an iterator [['x', 'y', None, 'z'], ['x', 'y', 'y', 'z']] 

No recursion (universal but less elegant):

 def permutations(lot): source = [] cnum = 1 # number of possible combinations for item, start, end in lot: # create full list without Nones source += [item] * (end-start+1) cnum *= (end-start+1) for i in range(cnum): bitmask = [True] * len(source) state = i pos = 0 for _, start, end in lot: state, m = divmod(state, end-start+1) # m - number of Nones to insert pos += end-start+1 bitmask[pos-m:pos] = [None] * m yield [bitmask[i] and c for i, c in enumerate(source)] 

The idea behind this solution: in fact, we look like a full line ( xyyz ), although glass adds a certain amount of None . We can count the number of possible combinations by calculating the product of all (end-start+1) . Then we can simply count all iterations (a simple range loop) and restore this mask from the iteration number. Here we restore the mask by iteratively using divmod by state number and using the remainder as the number Nones at the character position

+2
source

Source: https://habr.com/ru/post/1263164/


All Articles