Confused Regular Expression

The Collective Programming Intelligence book has a regular expression,

splitter = re.compile('\\W*') 

From the context, it looks like this matches any non-alphanumeric character. But I'm confused because it looks like it matches a backslash, and then one or more W. What does this really match?

+4
source share
6 answers

Your regular expression is equivalent to \W* . It matches 0 or more non-alphanumeric characters.

In fact, you are using the python string literal instead of the raw string. In the python string literature, to match a literal backslash, you need to avoid the backslash - \\ , since there is a backslash here. And then for regular expression you need to escape and with a backslash to do this - \\\\ .

So, to match \ followed by 0 or more W , you will need \\\\W* in a string literal. You can simplify this by using a raw string. Where a \\ will match the literal value \ . This is because the backslash is not handled in any special way when used inside an unprocessed string.

The following example will help you understand the following:

 >>> s = "\WWWW$$$$" # Without raw string >>> splitter = re.compile('\\W*') # Match non-alphanumeric characters >>> re.findall(splitter, s) ['\\', '', '', '', '', '$$$$', ''] >>> splitter = re.compile('\\\\W*') # Match `\` followed by 0 or more `W` >>> re.findall(splitter, s) ['\\WWWW'] # With raw string >>> splitter = re.compile(r'\W*') # Same as first one. You need a single `\` >>> re.findall(splitter, s) ['\\', '', '', '', '', '$$$$', ''] >>> splitter = re.compile(r'\\W*') # Same as 2nd. Two `\\` needed. >>> re.findall(splitter, s) ['\\WWWW'] 
+3
source

The first backslash exists as an escape character for programming languages ​​that do not have a good string representation of regular expressions (for example: Java). In Python you can do better, this is equivalent to:

 r'\W*' 

Note the r at the beginning (a raw string ), which makes it unnecessary to use the first escape \ character. Second \ inevitable, this part of the \W character class

+2
source

\ is an escape character in a regular expression. From left to right, \\ means \ , and then \w* , so it means matching any nonaplanumerical plus underscore characters. In this case, if you want to \ , you need to write \\\\ . If you want the regex to be clearer and simpler, you can use r'\W*' . r means raw string and may allow you to write less \ .

+1
source

This corresponds to characters other than words, not letters or underscores. This compiles to \ W, which is the negative version of \ w, where \ w matches any character in the word.

So, you are right in your thought that it does not correspond to alpha-numeric.

For help on special regular expression characters, you can look here. http://www.regular-expressions.info/reference.html

0
source

It happens that \ helps to escape characters. So, \\ means \ . So your regex becomes (after exiting):

 \W* 

A better alternative is to use: r'\W*'

0
source

This regex will match a backslash and zero or more W. If you want to match zero or more characters other than words:

 splitter = re.compile(r'\W*') 
-one
source

Source: https://habr.com/ru/post/1491920/


All Articles