Multiple regexpr on one line in R

Question

Multiple regexpr on one line in R

So, I have a very long string, and I want to work with multiple matches. It seems I can get the first position of the first match using regexpr . How can I get multiple positions (more matches) in one line?

I am looking for a specific line in the HTML source code. Auction title (which is between the html tags). This is hard to find:

So far I am using this:

 locationstart <- gregexpr("<span class=\"location-name\">", URL)[[1]]+28 locationend <- regexpr("<", substring(URL, locationstart[1], locationend[1] + 100)) substring(URL, locationstart[1], locationstart[1] + locationend - 2)

That is, I look for the part that precedes the heading, then I fix this place, from there they look for a "<" indicating that the heading has ended. I am open to more specific suggestions.

+4

regex r

PascalVKooten May 06 '13 at 15:30

source share

2 answers

gregexpr and regmatches , as suggested in Dason's answer, let you extract multiple instances of the regex pattern in a string. In addition, this solution has the advantage that it relies solely on the {base} R package, and not on an additional package.

However, I would like to offer an alternative solution based on stringr package . In general, this package simplifies the work with character strings, providing most of the functionality of various string support functions from the R base (and not just those related to regular expression functions), with a set of functions that are intuitively named and offer consistent APIs. In fact, stringr functions do not just replace the basic R functions, but in many cases introduce additional functions; for example, the regular expression functions stringr are vectorized for both a string and a template.

In particular, the question of extracting multiple patterns in a long string can be used as str_extract_all and str_match_all , as shown below. Depending on whether the input is a single line or its vector, the logic can be adapted using index / matrix indexes, unlist or other approaches such as lapply , sapply , etc. The thing is, String functions return structures that can be used to access only what we want.

 # simulate html input. (Using bogus html tags to mark the target texts; the demo works # the same for actual html patterns, the regular expression is just a bit more complex. htmlInput <- paste("Lorem ipsum dolor<blah>MATCH_ONE<blah> sit amet, purus", "sollicitudin<blah>MATCH2<blah>mauris, <blah>MATCH Nr 3<blah>vitae donec", "risus ipsum, aenean quis, sapien", "in lorem, condimentum ornare viverra", "suscipit <blah>LAST MATCH<blah> ipsum eget ac. Non senectus", "dolor mauris tellus, dui leo purus varius") # str_extract() may need a bit of extra work to remove the leading and trailing parts str_extract_all(htmlInput, "(<blah>)([^<]+)<") # [[1]] # [1] "<blah>MATCH_ONE<" "<blah>MATCH2<" "<blah>MATCH Nr 3<" "<blah>LAST MATCH<" str_match_all(htmlInput, "<blah>([^<]+)<")[[1]][, 2] # [1] "MATCH_ONE" "MATCH2" "MATCH Nr 3" "LAST MATCH"

+1

mjv Jan 24 '17 at 4:52

source share

Dason · Accepted Answer · 2013-05-06T15:32:56+0000

Using gregexpr allows for several matches.

 > x <- c("only one match", "match1 and match2", "none here") > m <- gregexpr("match[0-9]*", x) > m [[1]] [1] 10 attr(,"match.length") [1] 5 attr(,"useBytes") [1] TRUE [[2]] [1] 1 12 attr(,"match.length") [1] 6 6 attr(,"useBytes") [1] TRUE [[3]] [1] -1 attr(,"match.length") [1] -1 attr(,"useBytes") [1] TRUE

and if you want to extract a match, you can use regmatches for this.

 > regmatches(x, m) [[1]] [1] "match" [[2]] [1] "match1" "match2" [[3]] character(0)

Multiple regexpr on one line in R

More articles: