Expression to remove URLs from Twitter tweets

I just wanted to find and replace all occurrences of twitter url in line (tweet):

Input:

This is a tweet from the URL: http://t.co/0DlGChTBIx

Output:

This is a tweet from the URL:

I tried this:

p=re.compile(r'\<http.+?\>', re.DOTALL) tweet_clean = re.sub(p, '', tweet) 
+12
source share
5 answers

Do it:

 result = re.sub(r"http\S+", "", subject) 
  • http matches literal characters
  • \S+ matches all characters without spaces (end of URL)
  • replace the empty string
+41
source

In the following regular expression, two matched groups will be written: the first includes everything in a tweet until the URL and the second understand everything after the URL (empty in the example that you specified above):

 import re str = 'This is a tweet with a url: http://t.co/0DlGChTBIx' clean_tweet = re.match('(.*?)http.*?\s?(.*?)', str) if clean_tweet: print clean_tweet.group(1) print clean_tweet.group(2) # will print everything after the URL 
+2
source

You can try the following re.sub function to remove the URL link from your string,

 >>> str = 'This is a tweet with a url: http://t.co/0DlGChTBIx' >>> m = re.sub(r':.*$', ":", str) >>> m 'This is a tweet with a url:' 

It removes everything after the first character : and : in the replacement line adds : last.

This will print all characters that were just before the character :

 >>> m = re.search(r'^.*?:', str).group() >>> m 'This is a tweet with a url:' 
0
source

Try using this:

 text = re.sub(r"http\S+", "", text) 
0
source

clean_tweet = re.match ('(. *?) http (. *?) \ s (. *)', content)

while (clean_tweet):
content = clean_tweet.group (1) + "" + clean_tweet.group (3)
clean_tweet = re.match ('(. *?) http (. *?) \ s (. *)', content)

0
source

Source: https://habr.com/ru/post/971317/


All Articles