Extract the last 4-digit number from a series in R using stringr

I would like to flatten lists extracted from HTML tables. Below is a minimal working example. The example depends on the stringr package in R. The first example demonstrates the desired behavior.

 years <- c("2005-", "2003-") unlist(str_extract_all(years,"[[:digit:]]{4}")) [1] "2005" "2003" 

In the example below, an undesirable result occurs when I try to match the last 4-digit number in a series of other numbers.

 years1 <- c("2005-", "2003-", "1984-1992, 1996-") unlist(str_extract_all(years1,"[[:digit:]]{4}$")) character(0) 

As I understand the documentation, I have to include $ at the end of the pattern to request a match at the end of the line. I would prefer to compare with the second example the numbers "2005", "2003" and "1996".

+6
source share
4 answers

The stringi package has convenient functions that work with certain parts of the string. So you can find the last occurrence of four consecutive digits with the following.

 library(stringi) x <- c("2005-", "2003-", "1984-1992, 1996-") stri_extract_last_regex(x, "\\d{4}") # [1] "2005" "2003" "1996" 

Other ways to get the same result:

 stri_sub(x, stri_locate_last_regex(x, "\\d{4}")) # [1] "2005" "2003" "1996" ## or, since these count as words stri_extract_last_words(x) # [1] "2005" "2003" "1996" ## or if you prefer a matrix result stri_match_last_regex(x, "\\d{4}") # [,1] # [1,] "2005" # [2,] "2003" # [3,] "1996" 
+7
source

You can use base R sub to do this quite easily:

 sub('.*(\\d{4}).*', '\\1', years1) ## [1] "2005" "2003" "1996" 

The pattern to be matched here is .* (Zero or more of any character), followed by \\d{4} (four consecutive digits that we fix in parentheses), followed by zero or more characters.

sub replaces the matching pattern with the value in the second argument. In this case, \\1 indicates that we want to replace the entire matched pattern with the first captured substring (i.e., Four consecutive digits).

Here regex is greedy, so it will bypass the early matches \\d{4} by consuming them with .* . Only the last sequence of four consecutive digits is taken.

+7
source

End of line $ anchor asserts the position at the end of the line.

The statement exactly matches the four digits at the end of the line. Unfortunately, what is happening is that the numbers are trying to combine, and the regular expression mechanism is trying to assert this position and fails because they are not in this position and are stepping back, trying to match them.

To fix this, you can greed use all characters up to the last set of numbers.

 years1 <- c('2005-', '2003-', '1984-1992, 1996-') unlist(str_extract_all(years1, perl('.*\\K\\d{4}'))) # [1] "2005" "2003" "1996" 
+2
source
 \\d{4}[^\\d]*$ 

Try it. That should do it for you. Watch the demo.

https://regex101.com/r/kG5pN6/2

0
source

Source: https://habr.com/ru/post/982766/


All Articles