Regex. Match words containing special characters or "http: //"

I want to combine words containing special characters or starting with 'http: //'

So this sentence

% he # llo, my website: http://www.url.com/abcdef123

gotta turn into that

my site

I still have this

re.sub(r"^[^\w]", " ", "%he#llo, my website is: http://www.url.com/abcdef123") 

This simply removes the characters, but does not delete the words associated with the character (it also does not remove the ":" and ",") and does not delete the URL.

+4
source share
3 answers

For the example line you are giving, the following regular expression works fine:

 >>> a = '%he#llo, my website is: http://www.url.com/abcdef123' >>> re.findall('(http://\S+|\S*[^\w\s]\S*)',a) ['%he#llo,', 'is:', 'http://www.url.com/abcdef123'] 

... or you can remove these words with re.sub

 >>> re.sub('(http://\S+|\S*[^\w\s]\S*)','',a) ' my website ' 

| means alternating and will match the expression on each side within the group. The part on the left is the same as http:// followed by one or more non-spatial characters. The part on the right matches a zero or more non-space characters, followed by anything that is not a word or space character, followed by zero or more non-space characters, which ensures that you have a string with at least one non-character -word and spaces.

Updated: Of course, as other answers implicitly suggest, since the http:// prefix contains a character without the word ( / ), you do not need to have this as an alternative - you can simplify the regular expression to \S*[^\w\s]\S* . However, perhaps the above alternating example is still useful.

+6
source

You can use the look:

 >>> re.findall(r"(?:\s|^)(\w+)(?=\s|$)", "Start %he#llo, my website is: http://www.url.comabcdef123 End") ['Start', 'my', 'website', 'End'] 

Explanation:

  • (?:\s|^) means that our word starts a regular expression or precedes a space. (and space does not belong to the word).
  • (\w+) matches the word (and this is what interests us).
  • (?=\s|$) means that our word is followed by a space or the end of a line. (and again, space does not belong to the word).
+4
source

Do not use regular expressions, but maybe this can work? (I assume ':' and '/' are special characters, so it will remove the URL implicitly)

 def good_word(word): import string for c in word: if not c in string.ascii_letters: return False return True def clean_string(str): return ' '.join([w for w in input.split() if good_word(w)]) print clean_string("%he#llo, my website is: http://www.url.com/abcdef123") 
+2
source

Source: https://habr.com/ru/post/1335601/


All Articles