using re, although I'm not sure if this is what you need, because you said you didn’t want “cool” to remain.
import re
s = "coMPuter scien_tist-s are,,, the rock__stars of tomorrow_ <cool> ????"
REGEX = r'([^a-zA-Z\s]+)'
cleaned = re.sub(REGEX, '', s).split()
EDIT
WORD_REGEX = re.compile(r'(?!<?\S+>)(?=\w)(\S+)')
CLEAN_REGEX = re.compile(r'([^a-zA-Z])')
def cleaned(match_obj):
return re.sub(CLEAN_REGEX, '', match_obj.group(1)).lower()
[cleaned(x) for x in re.finditer(WORD_REGEX, s)]
WORD_REGEXuses a positive result for any word characters and a negative lookahead for <...>. No matter what, outside the space that breaks through it, it is called in a grouped way:
(?!<?\S+>)
(?=\w)
(\S+)
cleaned takes match groups and removes any characters without a word using CLEAN_REGEX
source
share