R Python equivalent re.findall

I'm trying to get all the matches for RegExp from a string, but apparently this is not so easy in R, or I missed something. In truth, this is really confusing, and I lost among all the options: str_extract , str_match , str_match_all , regexec , grep , gregexpr and who knows how many others.

In fact, everything I'm trying to accomplish is simple (in Python):

 >>> import re >>> re.findall(r'([\w\']+|[.,;:?!])', 'This is starting to get really, really annoying!!') ['This', 'is', 'starting', 'to', 'get', 'really', ',', 'really', 'annoying', '!', '!'] 

The problem with the functions mentioned above is that they either return a single match or do not return any match.

+5
source share
1 answer

In general, there is no R exact Python equivalent of re.findall , which returns a list of matching values ​​or (list) of tuples that contain re.findall of the capture group. The closest is str_match_all from the str_match_all package, but it is also very close to Python re.finditer (since it returns the match value in the first element and then all the submatrices (the contents of the capture group) in the subsequent elements (still not the exact equivalent of re.finditer , since only texts are returned, not data objects)). So, if the entire match value was not returned using str_match_all , that would be the exact equivalent of Python re.findall .

You use re.findall to just return matches, not to capture, the capture group in your template is redundant, and you can remove it. This way you can safely use regmatches with gregexpr and PCRE flavor (since [\\w'] will not work with the TRE regular expression):

 s <- "This is starting to get really, really annoying!!" res <- regmatches(s, gregexpr("[\\w']+|[.,;:?!]", s, perl=TRUE)) ## => [[1]] [1] "This" "is" "starting" "to" "get" "really" [7] "," "really" "annoying" "!" "!" 

See R demo

Or, to make \w Unicode-aware so that it works as in Python 3, add the (*UCP) verb PCRE:

 res <- regmatches(s, gregexpr("(*UCP)[\\w']+|[.,;:?!]", s, perl=TRUE)) 

See another demo of R

If you want to use the stringr package (which uses the ICU regex library behind the scenes), you need str_extract_all :

 res <- str_extract_all(s, "[\\w']+|[.,;:?!]") 
+5
source

Source: https://habr.com/ru/post/1266712/


All Articles