Regex - matching words in a template, except for an email address

I am looking for words in a string that matches a specific pattern. The problem is that if words are part of an email address, they should be ignored.

To simplify, the "right words" pattern \w+\.\w+ is one or more characters, the actual period, and another series of characters.

The application causing the problem, for example, is aa bb:cc d.d@e.e.e .

The goal is to match only [aa, bb, cc] . With most of the Regexes I build, ee returns (because I use some word matching).

For instance:

>>> re.findall(r"(?:^|\s|\W)(?< !@ )(\w+\.\w+)( ?!@ )\b", "aa bb:cc d.d@e.e.e ") ['a.a', 'b.b', 'c.c', 'e.e']

How can I match only words that do not contain "@"?

+5
source share
3 answers

I would definitely clear it first and simplify the regex.

first we have

 words = re.split(r':|\s', "aa bb:cc d.d@e.e.e ") 

then filter out the words that have @ .

 words = [re.search(r'^(( ?!@ ).)*$', word) for word in words] 
+2
source

Correct parsing of email addresses using a regular expression is extremely difficult, but for the simplified case, with a simple definition of the word ~ \w\.\w and email ~ any sequence that contains @ , you can find this regular expression to do what you need:

 >>> re.findall(r"(?:^|[:\s]+)(\w+\.\w+)(?=[:\s]+|$)", "aa bb:cc d.d@e.e.e ") ['a.a', 'b.b', 'c.c'] 

The trick here is not to focus on what comes in the next or previous word, but on what should look like at the moment.

Another trick is to correctly identify word delimiters. Before the word, we allow a few spaces : and the beginning of the line, consuming these characters, but not capturing them. After the word that we need is almost the same (except for the end of the line, instead of starting), but we do not consume these characters - we use the lookahead statement.

+1
source

You can match substrings like email with \ S+@ \S+\.\S+ and match and commit your pattern with (\w+\.\w+) in all other contexts. Use re.findall to only return committed values โ€‹โ€‹and filter out empty elements (they will be in re.findall results if there is an email match)

 import re rx = r"\ S+@ \S+\.\S+|(\w+\.\w+)" s = "aa bb:cc d.d@e.e.e " res = filter(None, re.findall(rx, s)) print(res) # => ['a.a', 'b.b', 'c.c'] 

See a demo of Python .

See the demo of regex .

+1
source

Source: https://habr.com/ru/post/1270441/


All Articles