gregexpr and regmatches , as suggested in Dason's answer, let you extract multiple instances of the regex pattern in a string. In addition, this solution has the advantage that it relies solely on the {base} R package, and not on an additional package.
However, I would like to offer an alternative solution based on stringr package . In general, this package simplifies the work with character strings, providing most of the functionality of various string support functions from the R base (and not just those related to regular expression functions), with a set of functions that are intuitively named and offer consistent APIs. In fact, stringr functions do not just replace the basic R functions, but in many cases introduce additional functions; for example, the regular expression functions stringr are vectorized for both a string and a template.
In particular, the question of extracting multiple patterns in a long string can be used as str_extract_all and str_match_all , as shown below. Depending on whether the input is a single line or its vector, the logic can be adapted using index / matrix indexes, unlist or other approaches such as lapply , sapply , etc. The thing is, String functions return structures that can be used to access only what we want.
# simulate html input. (Using bogus html tags to mark the target texts; the demo works # the same for actual html patterns, the regular expression is just a bit more complex. htmlInput <- paste("Lorem ipsum dolor<blah>MATCH_ONE<blah> sit amet, purus", "sollicitudin<blah>MATCH2<blah>mauris, <blah>MATCH Nr 3<blah>vitae donec", "risus ipsum, aenean quis, sapien", "in lorem, condimentum ornare viverra", "suscipit <blah>LAST MATCH<blah> ipsum eget ac. Non senectus", "dolor mauris tellus, dui leo purus varius") # str_extract() may need a bit of extra work to remove the leading and trailing parts str_extract_all(htmlInput, "(<blah>)([^<]+)<") # [[1]] # [1] "<blah>MATCH_ONE<" "<blah>MATCH2<" "<blah>MATCH Nr 3<" "<blah>LAST MATCH<" str_match_all(htmlInput, "<blah>([^<]+)<")[[1]][, 2] # [1] "MATCH_ONE" "MATCH2" "MATCH Nr 3" "LAST MATCH"