Divide the last occurrence of the number, take the 2nd part

Question

Divide the last occurrence of the number, take the 2nd part

If I have a line and you want to split the last digit and save the last split hpw, can I do this?

x <- c("ID", paste0("X", 1:10, state.name[1:10]))

I would like to

  [1] NA "Alabama" "Alaska" "Arizona" "Arkansas" [6] "California" "Colorado" "Connecticut" "Delaware" "Florida" [11] "Georgia"

But agree to:

  [1] "ID" "Alabama" "Alaska" "Arizona" "Arkansas" [6] "California" "Colorado" "Connecticut" "Delaware" "Florida" [11] "Georgia"

I can get the first part:

 unlist(strsplit(x, "[^0-9]*$"))

But I want the second part.

Thanks in advance.

+6

regex r

Tyler rinker May 24 '12 at 5:57

source share

4 answers

You can take this simple step with a regex:

 gsub("(^.*\\d+)(\\w*)", "\\2", x)

Results in:

  [1] "ID" "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Colorado" "Connecticut" [9] "Delaware" "Florida" "Georgia"

What the regex does:

"(^.*\\d+)(\\w*)" : Find two groups of characters.
- The first group (^.*\\d+) searches for any digit followed by at least one number at the beginning of the line.
- The second group \\w* searches for an alphanumeric character.
"\\2" as the second argument to gsub() means replacing the original string with the second group found by the regular expression.

+4

Andrie May 24 '12 at 9:01

source share

This seems a bit awkward, but it works:

 state.pt2 <- unlist(strsplit(x,"^.[0-9]+")) state.pt2[state.pt2!=""]

It would be nice to remove the "" generated by the match at the beginning of the line, but I can't figure it out.

Another method using substr and gregexpr also used gregexpr , which avoids a subset of the results:

 substr(x,unlist(lapply(gregexpr("[0-9]",x),max))+1,nchar(x))

+2

thelatemail May 24 '12 at 6:16

source share

gsubfn

Try the gsubfn solution:

 > library(gsubfn) > strapply(x, ".*\\d(\\w*)|$", ~ if (nchar(z)) z else NA, simplify = TRUE) [1] NA "Alabama" "Alaska" "Arizona" "Arkansas" [6] "California" "Colorado" "Connecticut" "Delaware" "Florida" [11] "Georgia"

It matches the last digit followed by the characters of the word, and returns the characters of the word, or if it doesn't match the end of the line (to make sure it matches something). If the first match is successful, return it; otherwise the backlink will be empty, so return NA.

Note that a formula is a short way to write function(z) if (nchar(z)) z else NA , and this function can alternately replace the formula with a bit more keystrokes.

gsub

A similar strategy may also work using only the direct gsub , but requires two lines and a slightly more complex regex. Here we use the second alternative to eliminate matches with the first alternative:

 > s <- gsub(".*\\d(\\w*)|.*", "\\1", x) > ifelse(nchar(s), s, NA) [1] NA "Alabama" "Alaska" "Arizona" "Arkansas" [6] "California" "Colorado" "Connecticut" "Delaware" "Florida" [11] "Georgia"

EDIT: minor improvements

+2

G. grothendieck May 24 '12 at 12:12

source share

mnel · Accepted Answer · 2012-05-24T06:08:55+0000

 library(stringr) unlist(lapply(str_split(x, "[0-9]"), tail,n=1))

gives

 [1] "ID" "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Colorado" "Connecticut" "Delaware" [10] "Florida" "Georgia"

I would look at the stringr documentation for a (most likely) even better approach.

Divide the last occurrence of the number, take the 2nd part

More articles: