Extract numbers between brackets inside a string

Possible duplicate:
Extract information in all brackets in R (regex)

I imported data from excel and one cell consists of these long strings that contain a number and letters, is there a way to extract only numbers from this row and save it in a new variable? Unfortunately, some of the entries have two sets of brackets, and I would like only the second? Can I use grep for this?

the lines look something like this: the length of the lines changes:

"East Kootenay C (5901035) RDA 01011" 

or like this:

 "Thompson-Nicola J (Copper Desert Country) (5933039) RDA 02020" 

All I want from this is 5901035 and 5933039

Any hints and help would be greatly appreciated.

+4
source share
2 answers

There are many possible regular expressions for this. Here is one of them:

 x=c("East Kootenay C (5901035) RDA 01011","Thompson-Nicola J (Copper Desert Country) (5933039) RDA 02020") > gsub('.+\\(([0-9]+)\\).+?$', '\\1', x) [1] "5901035" "5933039" 

Let's decompose the syntax of this first expression '.+\\(([0-9]+)\\).+'

  • .+ one or more things
  • \\( parentheses are special characters in the regular expression, so if I want to represent the actual thing ( I need to escape it with \ . I have to avoid this again for R (hence two \ s).

  • ([0-9]+) I mentioned special characters, here I use two. the first is the brackets that indicate the group that I want to keep. The second group [ and ] surrounds groups of things. see ?regex for more information.

  • ?$ The last part ensures that I grab the LAST set of numbers in parens, as noted in the comments.

I could also use * instead . which would mean 0 or more, not one or more i if your parn line appears at the beginning or end of the line.

The second part of gsub is that I replace the first part. I used: \\1 . This suggests using group 1 (the material inside ( ) on top. I need to remove it twice, once for a regular expression and once for R.

Clear as dirt to be sure! Enjoy the data collection project!

+10
source

Here is the gsubfn solution:

 library(gsubfn) strapplyc(x, "[(](\\d+)[)]", simplify = TRUE) 

[(] corresponds to an open pair, (\\d+) corresponds to a string of digits creating a backward link due to the parentheses around it, and finally [)] corresponds to a close pair. Return link is returned.

+3
source

Source: https://habr.com/ru/post/1437884/


All Articles