Extract template substrings from a text file in R

I want to extract all the unique substrings of text from a text file using R, which adhere to the form "matrixname [rowname, column number]". I have achieved only limited success with grep and extract_string_all (stringr) in the sense that it will return only a whole string, not just a substring. An attempt to replace unwanted text with gsub failed. Here is an example of the code I used.

#Read in file
txt<-read.table("Project_R_code.R")
#create new object to create lines that contain this pattern    
txt2<-grep("param\\[.*1\\]",txt$V1, value=TRUE)
#remove all text that does not match the above pattern
gsub("[^param\\[.*1\\]]","", txt2,perl=TRUE)

The second line works (but again does not give me a substring of this pattern). However, gsub's code to remove inappropriate patterns saves strings and turns them into something like this:

[200] "[p.p]param[ama1]param[ama11]*[r1]param[ama1]...

and I have no idea why. I understand that this method of traversing a line into something more manageable is more tedious, but this is the only way I know how to get patterns.

Preferably, I would prefer that R pop out a list of all the (unique) substrings that it finds in the text file that match my pattern, but I don't know this command. Any help on this is greatly appreciated.

+4
source share
1 answer

If you want to extract individual components, try str_match:

test <- c("aaa[name1,1]", "bbb[name2,3]", "ccc[name3,3]")
stringr::str_match(test, "([a-zA-Z0-9_]+)[[]([a-zA-Z0-9_]+),.*?(\\d+)\\]")
##      [,1]           [,2]  [,3]    [,4]
## [1,] "aaa[name1,1]" "aaa" "name1" "1" 
## [2,] "bbb[name2,3]" "bbb" "name2" "3" 
## [3,] "ccc[name3,3]" "ccc" "name3" "3" 

Otherwise use str_extract.

, [ ERE/TRE , [, .. [[].

, , str_match_all str_extract_all.

+1

Source: https://habr.com/ru/post/1544152/


All Articles