Extract text in parentheses in R

Two related questions. I have text data vectors like

"a(b)jk(p)" "ipq" "e(ijkl)" 

and you want to easily divide it into a vector containing the text OUTSIDE of parentheses:

 "ajk" "ipq" "e" 

and a vector containing the text Insert parentheses:

 "bp" "" "ijkl" 

Is there an easy way to do this? Another complication is that they can become quite large and have a large (unlimited) number of parentheses. Thus, I cannot just grab the text "pre / post" of parentheses and need a smarter solution.

+6
source share
2 answers

Text outside the bracket

 > x <- c("a(b)jk(p)" ,"ipq" , "e(ijkl)") > gsub("\\([^()]*\\)", "", x) [1] "ajk" "ipq" "e" 

The text inside the parenthesis

 > x <- c("a(b)jk(p)" ,"ipq" , "e(ijkl)") > gsub("(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.", "", x, perl=T) [1] "bp" "" "ijkl" 

(?<=\\()[^()]*(?=\\)) matches all characters that are present inside the brackets, and then (*SKIP)(*F) cause the match to fail. Now he is trying to execute the template that was immediately after the symbol | against the remaining line. So the point . matches all characters that are not yet missing. Replacing all matching characters with an empty string will give only the text present in the rockets.

 > gsub("\\(([^()]*)\\)|.", "\\1", x, perl=T) [1] "bp" "" "ijkl" 

This regular expression will capture all characters that are in brackets, and matches all other characters. |. or a part helps to match all other characters other than captured ones. Therefore, replacing all the characters with the characters present within the group, index 1 will give you the desired result.

+10
source

rm_round function in rm_round package . I claim I was born to do this:

First we get and download the package through pacman

 if (!require("pacman")) install.packages("pacman") pacman::p_load(qdapRegex) 

## Then we can use it to remove and extract the desired parts :

 x <-c("a(b)jk(p)", "ipq", "e(ijkl)") rm_round(x) ## [1] "ajk" "ipq" "e" rm_round(x, extract=TRUE) ## [[1]] ## [1] "b" "p" ## ## [[2]] ## [1] NA ## ## [[3]] ## [1] "ijkl" 

To condense b and p use:

 sapply(rm_round(x, extract=TRUE), paste, collapse="") ## [1] "bp" "NA" "ijkl" 
+5
source

Source: https://habr.com/ru/post/983560/


All Articles