For the example line you are giving, the following regular expression works fine:
>>> a = '%he#llo, my website is: http://www.url.com/abcdef123' >>> re.findall('(http://\S+|\S*[^\w\s]\S*)',a) ['%he#llo,', 'is:', 'http://www.url.com/abcdef123']
... or you can remove these words with re.sub
>>> re.sub('(http://\S+|\S*[^\w\s]\S*)','',a) ' my website '
| means alternating and will match the expression on each side within the group. The part on the left is the same as http:// followed by one or more non-spatial characters. The part on the right matches a zero or more non-space characters, followed by anything that is not a word or space character, followed by zero or more non-space characters, which ensures that you have a string with at least one non-character -word and spaces.
Updated: Of course, as other answers implicitly suggest, since the http:// prefix contains a character without the word ( / ), you do not need to have this as an alternative - you can simplify the regular expression to \S*[^\w\s]\S* . However, perhaps the above alternating example is still useful.
source share