Regex. Match words containing special characters or "http: //"

Question

Regex. Match words containing special characters or "http: //"

I want to combine words containing special characters or starting with 'http: //'

So this sentence

% he # llo, my website: http://www.url.com/abcdef123

gotta turn into that

my site

I still have this

re.sub(r"^[^\w]", " ", "%he#llo, my website is: http://www.url.com/abcdef123")

This simply removes the characters, but does not delete the words associated with the character (it also does not remove the ":" and ",") and does not delete the URL.

+4

python regex

user216171 Jan 14 '11 at 19:36

source share

3 answers

You can use the look:

 >>> re.findall(r"(?:\s|^)(\w+)(?=\s|$)", "Start %he#llo, my website is: http://www.url.comabcdef123 End") ['Start', 'my', 'website', 'End']

Explanation:

(?:\s|^) means that our word starts a regular expression or precedes a space. (and space does not belong to the word).
(\w+) matches the word (and this is what interests us).
(?=\s|$) means that our word is followed by a space or the end of a line. (and again, space does not belong to the word).

+4

Antoine pelisse Jan 14 '11 at 19:51

source share

Do not use regular expressions, but maybe this can work? (I assume ':' and '/' are special characters, so it will remove the URL implicitly)

 def good_word(word): import string for c in word: if not c in string.ascii_letters: return False return True def clean_string(str): return ' '.join([w for w in input.split() if good_word(w)]) print clean_string("%he#llo, my website is: http://www.url.com/abcdef123")

+2

yan Jan 14 '11 at 19:46

source share

Mark longair · Accepted Answer · 2011-01-14T20:09:07+0000

For the example line you are giving, the following regular expression works fine:

 >>> a = '%he#llo, my website is: http://www.url.com/abcdef123' >>> re.findall('(http://\S+|\S*[^\w\s]\S*)',a) ['%he#llo,', 'is:', 'http://www.url.com/abcdef123']

... or you can remove these words with re.sub

 >>> re.sub('(http://\S+|\S*[^\w\s]\S*)','',a) ' my website '

| means alternating and will match the expression on each side within the group. The part on the left is the same as http:// followed by one or more non-spatial characters. The part on the right matches a zero or more non-space characters, followed by anything that is not a word or space character, followed by zero or more non-space characters, which ensures that you have a string with at least one non-character -word and spaces.

Updated: Of course, as other answers implicitly suggest, since the http:// prefix contains a character without the word ( / ), you do not need to have this as an alternative - you can simplify the regular expression to \S*[^\w\s]\S* . However, perhaps the above alternating example is still useful.

Regex. Match words containing special characters or "http: //"

More articles: