Retrieve line items that may appear multiple times or not appear at all

Start with a character vector of URLs. The goal is to ultimately get only the company name, which means the column with "test" , "example" and "sample" in the example below.

 urls <- c("http://grand.test.com/", "https://example.com/", "http://.big.time.sample.com/") 

Remove ".com" and everything that can follow it, and save the first part:

 urls <- sapply(strsplit(urls, split="(?<=.)(?=\\.com)", perl=T), "[", 1) urls # [1] "http://grand.test" "https://example" "http://.big.time.sample" 

My next step is to remove the http:// and https:// snippets using the gsub() chain:

 urls <- gsub("^http://", "", gsub("^https://", "", urls)) urls # [1] "grand.test" "example" ".big.time.sample" 

But here I need help. How to handle multiple periods (periods) before the company name in the first and third lines of URLs? For example, the call below returns NA for the second row, since the string "example" has no remaining period. Or, if I keep only the first part, I lose the name of the company.

 urls <- sapply(strsplit(urls, split = "\\."), "[", 2) urls # [1] "test" NA "big" urls <- sapply(strsplit(urls, split = "\\."), "[", 1) urls # [1] "grand" "example" "" 

Perhaps an ifelse() call that counts the number of remaining periods and uses only strsplit if there is more than one period? Also note that there is the possibility of having two or more periods before the company name. I do not know how to do searches that can solve my problem. But it was not

 strsplit(urls, split="(?=\\.)", perl=T) 

Thanks for any suggestions.

+6
source share
7 answers

Here's an approach that might be easier to understand and generalize than some others:

 pat = "(.*?)(\\w+)(\\.com.*)" gsub(pat, "\\2", urls) 

It works by breaking each row into three capture groups that together correspond to the entire row, and replacing it back only in the capture group (2) with the one you want.

 pat = "(.*?)(\\w+)(\\.com.*)" # ^ ^ ^ # | | | # (1) (2) (3) 

Edit (adding modifier explanation ? ) :

Note that capture group (1) must include an “illiterate” or “minimum” quantifier ? ( also sometimes called "lazy" or "reluctantly" ). He essentially says that the regex engine matches the number of characters it can ... without using any that might otherwise become part of the next capture group (2) .

Without end ? default repetition quantifiers are greedy; in this case, the greedy capture group (.*) , since it matches any number of characters of any type, “eats” all the characters in the string, leaving nothing at all for the other two capture groups - this is not the behavior we want!

+3
source

I think it should be simpler, but this works:

  sub('.*[.]','',sub('https?:[/]+[.]?(.*)[.]com[/]','\\1',urls)) [1] "test" "example" "sample" 

Where "URL" is your first URL vector.

+3
source

I think there is a way to simply extract the word before '.com`, but perhaps it gives an idea

 sub(".com", "", regmatches(urls, gregexpr("(\\w+).com", urls))) 
+3
source

Using strsplit might be worth a try:

 sapply(strsplit(urls,"/|\\."),function(x) tail(x,2)[1]) #[1] "test" "example" "sample" 
+2
source

It was a terrific example. Useful answers and some explanations are generated very quickly.

Answering my own question, I do not describe what I am doing. I would like to thank the contributors, give something that could help others who are considering this question, and explain why I chose one answer. The comment did not seem correct and not long enough.

The following answers come together with my (modest and joyful corrections) explanations, some of which contain explanations from the defendants. Obedience to the answers taught me a lot and helped me choose the preferred answer. Others used non-base-R functions, one created function that might well be great, but not so easily accessible. I liked the second answer because it used only a helper function, but I gave a fifth laurel wreath for its elegant use of two methods that I really liked. Thanks to everyone.

ANS 1

 sub(".com", "", regmatches(urls, gregexpr("(\\w+).com", urls))) 

gregexpr finds any one or more words using the special character "w+" before ".com" and returns a list with length and usebytes

regmatches accepts gregexpr found and returns only identified rows

sub removes the first ".com" from each line [I'm not sure why gsub would not work, but perhaps the global sub is a risk when you just want the first instance]

ANS 2

 sub('.*[.]','', sub('https?:[/]+[.]?(.*)[.]com[/]','\\1',urls)) 

the inner sub handles both "http:" and "https:" with a special question mark character?, which allows "s" to be optional

the internal subfunction then processes one or more “/” with a character class containing only one slash, but expanded with a "+" , i.e. twice at http: //

The next part of the internal regular expression indication on the right includes any number of characters as optional with "[.]?

next, the period preceding "com" is placed in brackets rather than escaping it

then "com" followed by a slash (I'm not sure I understand this part)

"'\\1' saves only the first part of the fact that the subfunction is extracted

all this returns:

 [1] "grand.test" "example" "big.time.sample" 

the leftmost subfunction returns the result of internal subfunctions and deletes all characters with ".*" preceding the period with a square

ANS 3

 sapply(strsplit(urls, "/|\\."), function(x) tail(x,2)[1]) 

First, strsplit splits each line with a slash or period using a vertical pipe | which creates a list

 [[1]] [1] "http:" "" "grand" "test" "com" [[2]] [1] "https:" "" "example" "com" [[3]] [1] "http:" "" "" "big" "time" "sample" "com" 

Then the anonymous function finds the last two elements in each line using the tail function and selects the first, thereby carefully eliminating each ".com"

Wrapping these two steps with the sapply function vectorizes the operation of the anonymous function to all three lines

ANS 4

 library(stringr) word(basename(urls), start = -2, sep = "\\.") 

The basename function returns

 [1] "grand.test.com" "example.com" ".big.time.sample.com" 

From basename() help, we learn that “basename removes the entire path to and includes the last path separator (if any)” This gently removes the http: // and https: // elements.

Then the word() function takes the second "word" from the end using the negative operator (start = -2), given that it is a delimiter. (period) (sep = "\.").

ANS 5

 pat = "(.*?)(\\w+)(\\.com.*)" gsub(pat, "\\2", urls) 

The regular expression assigned to the pat object breaks each line into three capture groups that together correspond to the entire line

using the gsub function, looking for the string "pat", it replaces back only in the capture group (2) of the desired part.

Pay attention to two methods here: create an object with your expression, and then use it in a regular expression. This method helps to clear the code better and is easier to read, as shown on the line with gsub call. Second, pay attention to the use of capturing groups, which are components of a regular expression enclosed in parentheses. They can be used later, as in the case of "\ 2" in this example.

 pat = "(.*?)(\\w+)(\\.com.*)" # ^ ^ ^ # | | | # (1) (2) (3) 

ANS 6

 regcapturedmatches(urls, regexpr("([^.\\/]+)\\.com", urls, perl=T)) 

This may be a good solution, but it depends on the regcapturematches function, which is not in the R base or another package like qdap or stringi or stringr

Mr. Flick says that "if you want just simple vectors for the return value, you can block () the results."

He explains that "The idea of ​​the template is to capture everything that is not a dot or" / ", immediately before" .com "". This is an expression in brackets, with a + sign, to indicate that it can be a multiple.

Perl = T seems like a good argument for all regular expressions

+2
source

You can use stringr::word() along with basename() .

basename() handy for working with URLs.

 > library(stringr) > word(basename(urls), start = -2, sep = "\\.") # [1] "test" "example" "sample" 

basename(urls) gives

 [1] "grand.test.com" "example.com" ".big.time.sample.com" 

Then, in the word() function, we take the second word from the end ( start = -2 ), given that it is a delimiter . ( sep = "\\." ).

+1
source

Since you have never had enough regex options, the regcapturedmatches.R function is used here

 regcapturedmatches(urls, regexpr("([^.\\/]+)\\.com", urls, perl=T)) 

If you need only simple vectors for the return value, you can unlist() to get the results. The idea of ​​the template is to capture anything that is not a dot or "/" immediately before ".com".

+1
source

Source: https://habr.com/ru/post/971078/


All Articles