I am working on a canned group of regular expressions for common tasks like this, which I threw into the package, qdapRegex, on github , which will eventually go to CRAN. It can also extract pieces as well as cut them. Feedback with package is welcome for any look.
Here he is:
library (devtools) install_github("trinker/qdapRegex") library(qdapRegex) x <- c("download file from http://example.com", "this is the link to my website http://example.com", "go to http://example.com from more info.", "Another url ftp://www.example.com", "And https://www.example.net", "twitter type: t.co/N1kq0F26tG", "still another one https://t.co/N1kq0F26tG :-)") rm_url(x, pattern=pastex("@rm_twitter_url", "@rm_url")) ## [1] "download file from" "this is the link to my website" ## [3] "go to from more info." "Another url" ## [5] "And" "twitter type:" ## [7] "still another one :-)" rm_url(x, pattern=pastex("@rm_twitter_url", "@rm_url"), extract=TRUE) ## [[1]] ## [1] "http://example.com" ## ## [[2]] ## [1] "http://example.com" ## ## [[3]] ## [1] "http://example.com" ## ## [[4]] ## [1] "ftp://www.example.com" ## ## [[5]] ## [1] "https://www.example.net" ## ## [[6]] ## [1] "t.co/N1kq0F26tG" ## ## [[7]] ## [1] "https://t.co/N1kq0F26tG"
Change I saw that Twitter links are not deleted. I will not add this to the regular expression specific to the rm_url function, but adding it to the dictionary in qdapRegex . Thus, there is no specific function for removing standard URLs and twitter, but pastex (insert regular expression) makes it easy to capture regular expressions from the dictionary and together them together (using the pipe operator, | ). Since all rm_XXX style rm_XXX work essentially the same, you can pass the pastex output to the pattern argument of any rm_XXX function or create your own function, as shown below:
rm_twitter_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url")) rm_twitter_url(x) rm_twitter_url(x, extract=TRUE)
source share