Remove URL from string in R

I have a string vector - myStrings -in R, which looks something like this:

 [1] download file from `http://example.com` [2] this is the link to my website `another url` [3] go to `another url` from more info. 

where another url is a valid http another url but stackoverflow doesn't allow me to embed more than one url, so I write another url instead. I want to remove all urls from myStrings to look like this:

 [1] download file from [2] this is the link to my website [3] go to from more info. 

I tried many functions in the stringr package, but nothing works.

+6
source share
4 answers

You can use gsub with regex to match URLs,

Set up the vector:

 x <- c( "download file from http://example.com", "this is the link to my website http://example.com", "go to http://example.com from more info.", "Another url ftp://www.example.com", "And https://www.example.net" ) 

Remove all urls from each line:

 gsub(" ?(f|ht)tp(s?)://(.*)[.][az]+", "", x) # [1] "download file from" "this is the link to my website" # [3] "go to from more info." "Another url" # [5] "And" 

Update:. It would be better if you could publish several different URLs so that we know what we are working with. But I think this regex will work for the URLs mentioned in the comments:

 " ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)" 

The above expression is explained:

  • ? extra space
  • (f|ht) matches "f" or "ht"
  • tp matches "tp"
  • (s?) does not necessarily match "s" if it is there
  • (://) matching "://"
  • (.*) matches each character (all) before
  • [.|/] period or slash
  • (.*) then everything after that

I am not an expert with regular expressions, but I think I have correctly explained.

Note: abbreviated URL abbreviations are no longer allowed in SO responses, so I was forced to delete the section when I made my last change. See change history for this part.

+8
source

I am working on a canned group of regular expressions for common tasks like this, which I threw into the package, qdapRegex, on github , which will eventually go to CRAN. It can also extract pieces as well as cut them. Feedback with package is welcome for any look.

Here he is:

 library (devtools) install_github("trinker/qdapRegex") library(qdapRegex) x <- c("download file from http://example.com", "this is the link to my website http://example.com", "go to http://example.com from more info.", "Another url ftp://www.example.com", "And https://www.example.net", "twitter type: t.co/N1kq0F26tG", "still another one https://t.co/N1kq0F26tG :-)") rm_url(x, pattern=pastex("@rm_twitter_url", "@rm_url")) ## [1] "download file from" "this is the link to my website" ## [3] "go to from more info." "Another url" ## [5] "And" "twitter type:" ## [7] "still another one :-)" rm_url(x, pattern=pastex("@rm_twitter_url", "@rm_url"), extract=TRUE) ## [[1]] ## [1] "http://example.com" ## ## [[2]] ## [1] "http://example.com" ## ## [[3]] ## [1] "http://example.com" ## ## [[4]] ## [1] "ftp://www.example.com" ## ## [[5]] ## [1] "https://www.example.net" ## ## [[6]] ## [1] "t.co/N1kq0F26tG" ## ## [[7]] ## [1] "https://t.co/N1kq0F26tG" 

Change I saw that Twitter links are not deleted. I will not add this to the regular expression specific to the rm_url function, but adding it to the dictionary in qdapRegex . Thus, there is no specific function for removing standard URLs and twitter, but pastex (insert regular expression) makes it easy to capture regular expressions from the dictionary and together them together (using the pipe operator, | ). Since all rm_XXX style rm_XXX work essentially the same, you can pass the pastex output to the pattern argument of any rm_XXX function or create your own function, as shown below:

 rm_twitter_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url")) rm_twitter_url(x) rm_twitter_url(x, extract=TRUE) 
+4
source
  str1 <- c("download file from http://example.com", "this is the link to my website https://www.google.com/ for more info") gsub('http\\S+\\s*',"", str1) #[1] "download file from " #[2] "this is the link to my website for more info" library(stringr) str_trim(gsub('http\\S+\\s*',"", str1)) #removes trailing/leading spaces #[1] "download file from" #[2] "this is the link to my website for more info" 

Update

To fit ftp , I would use the same idea from @Richard Scriven's post

  str1 <- c("download file from http://example.com", "this is the link to my website https://www.google.com/ for more info", "this link to ftp://www.example.org/community/mail/view.php?f=db/6463 gives more info") gsub('(f|ht)tp\\S+\\s*',"", str1) #[1] "download file from " #[2] "this is the link to my website for more info" #[3] "this link to gives more info" 
+2
source

Some of the previous answers are removed outside the URL, and the extension "\ b" will help. It can also span the URLs "sftp: //".

For regular URLs:

 gsub("(s?)(f|ht)tp(s?)://\\S+\\b", x) 

For tiny URLs:

 gsub("[A-Za-z]{1,5}[.][A-Za-z]{2,3}/[A-Za-z0-9]+\\b", "", x) 
0
source

Source: https://habr.com/ru/post/973939/


All Articles