Is it possible to mix character classes in Python RegEx?

Special sequences (character classes) in Python RegEx are escape sequences like \w or \d that match a character set.

In my case, I need to be able to match all alpha-numeric characters except numbers.

That is, \w minus \d .

I need to use the special \w sequence because I am dealing with non-ASCII characters and must match characters like "Æ" and "Ø".

One would think that I could use this expression: [\w^\d] , but it does not seem to match anything, and I'm not sure why.

In short, how can I mix (add / subtract) special sequences in Python regular expressions?


EDIT : I accidentally used [\w^\d] instead of [\w^\d] . The latter does indeed correspond to something, including parentheses and commas, which are not alpha-numeric characters as far as I know.

+4
source share
4 answers

You can use r"[^\W\d]" , i.e. invert the union of non-literal numbers and numbers.

+12
source

You cannot subtract character classes, no.

It is best to use the new regex module to replace the current re module with python. It supports character classes based on Unicode properties:

 \p{IsAlphabetic} 

This will match any character that specifies the Unicode specification - it's an alphabetical character.

Even better, regex supports character class subtraction; it considers classes such as sets and allows you to create a difference with the operator -- :

 [\w--\d] 

matches all in \w , except everything that also matches \d .

+5
source

You can exclude classes using a negative lookahead expression, such as r'(?!\d)[\w]' , to match the character of a word, except for numbers. For instance:

 >>> re.search(r'(?!\d)[\w]', '12bac') <_sre.SRE_Match object at 0xb7779218> >>> _.group(0) 'b' 

To exclude more than one group, you can use the usual syntax [...] in the lookahead statement, for example r'(?![0-5])[\w]' will match any alphanumeric character, except for digits 0- 5.

As with [...] , the above construction corresponds to one character. To combine multiple characters, add a repeat statement:

 >>> re.search(r'((?!\d)[\w])+', '12bac15') <_sre.SRE_Match object at 0x7f44cd2588a0> >>> _.group(0) 'bac' 
+2
source

I do not think that you can directly combine (logical and) character sets in one regular expression, regardless of whether it is negative or not. Otherwise, you could just combine [^\d] and \w .

Note. ^ must be at the beginning of the set and apply to the whole set. From the docs: "If the first character of the set is" ^ ", all characters that are not in the set will be matched." Your set [\w^\d] trying to match an alpha-numeric character, followed by a carriage, and then a digit. I can imagine that this also does not fit.

I would do this in two steps, effectively combining regular expressions. The first match is non-digital (internal regular expression), then the match is alphanumeric:

 re.search('\w+', re.search('([^\d]+)', s).group(0)).group(0) 

or variations of this theme.

Note that this would have to surround this with a try: except: block, since it will throw AttributeError: 'NoneType' object has no attribute 'group' if one of the two regular expressions fails. But you can, of course, split this line into several lines.

+1
source

Source: https://habr.com/ru/post/1433333/


All Articles