A greedy match with a negative look in regular expression

I have a regular expression in which I try to extract every group of letters that are not immediately followed by the symbol “(”. For example, the following regular expression works with a mathematical formula that includes the names of variables (x, y and z) and the names of functions ( movav and movsum), both of which consist entirely of letters, but where the function names are followed by "(".

re.findall("[a-zA-Z]+(?!\()", "movav(x/2, 2)*movsum(y, 3)*z") 

I would like the expression to return an array

 ['x', 'y', 'z'] 

but instead returns an array

 ['mova', 'x', 'movsu', 'y', 'z'] 

I understand why the regex will return a second result, but is there a way to change it to return an array ['x', 'y', 'z'] ?

+6
source share
4 answers

Another solution that does not depend on word boundaries:

Make sure that the letters are not followed by either ( or another letter.

 >>> re.findall(r'[a-zA-Z]+(?![a-zA-Z(])', "movav(x/2, 2)*movsum(y, 3)*z") ['x', 'y', 'z'] 
+3
source

Add word-delimiter \b :

 >>> re.findall(r'[a-zA-Z]+\b(?!\()', "movav(x/2, 2)*movsum(y, 3)*z") ['x', 'y', 'z'] 

\b matches an empty line between two words, so now you are looking for letters, followed by the word boundary, which immediately follows ( . For more information, see re docs .

+3
source

You need to limit the match to whole words. So use \b to match the beginning or end of a word:

 re.findall(r"\b[a-zA-Z]+\b(?!\()", "movav(x/2, 2)*movsum(y, 3)*z") 
+1
source

Alternative approach: find lines of letters followed by either the end of the line or a non-letter character without an anchor; then write down part of the letter.

 re.findall("([a-zA-Z]+)(?:[^a-zA-Z(]|$)", "movav(x/2, 2)*movsum(y, 3)*z") 
+1
source

Source: https://habr.com/ru/post/900723/


All Articles