Regular expression to skip some characters

I am trying to clear a string so that it does not have punctuation marks or numbers, it should only have az and AZ. For example, this line:

"coMPuter scien_tist-s are,,,  the  rock__stars of tomorrow_ <cool>  ????"

Required Conclusion:

['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']

My decision

re.findall(r"([A-Za-z]+)" ,string)

My conclusion

['coMPuter', 'scien', 'tist', 's', 'are', 'the', 'rock', 'stars', 'of', 'tomorrow', 'cool']
+4
source share
3 answers

You do not need to use regex:

(Convert the string to lowercase if you want all words with a lower box) Separate the words, then filter out the word that begins with the alphabet:

>>> s = "coMPuter scien_tist-s are,,,  the  rock__stars of tomorrow_ <cool>  ????"
>>> [filter(str.isalpha, word) for word in s.lower().split() if word[0].isalpha()]
['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']

In Python 3.x, filter(str.isalpha, word)it should be replaced with ''.join(filter(str.isalpha, word)), because in Python 3.x, it filterreturns a filter object.

+5
source

, , , , ...

s = "coMPuter scien_tist-s are,,,  the  rock__stars of tomorrow_ <cool>  ????"    
cleaned = re.sub(r'(<.*>|[^a-zA-Z\s]+)', '', s).split()
print cleaned
+3

using re, although I'm not sure if this is what you need, because you said you didn’t want “cool” to remain.

import re

s = "coMPuter scien_tist-s are,,,  the  rock__stars of tomorrow_ <cool>  ????"

REGEX = r'([^a-zA-Z\s]+)'

cleaned = re.sub(REGEX, '', s).split()
# ['coMPuter', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow', 'cool']

EDIT

WORD_REGEX = re.compile(r'(?!<?\S+>)(?=\w)(\S+)')
CLEAN_REGEX = re.compile(r'([^a-zA-Z])')

def cleaned(match_obj):
    return re.sub(CLEAN_REGEX, '', match_obj.group(1)).lower()

[cleaned(x) for x in re.finditer(WORD_REGEX, s)]
# ['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']

WORD_REGEXuses a positive result for any word characters and a negative lookahead for <...>. No matter what, outside the space that breaks through it, it is called in a grouped way:

(?!<?\S+>) # negative lookahead
(?=\w) # positive lookahead
(\S+) #group non-whitespace

cleaned takes match groups and removes any characters without a word using CLEAN_REGEX

+1
source

Source: https://habr.com/ru/post/1671427/


All Articles