Subscript characters in a column in a data table. in R

Question

Subscript characters in a column in a data table. in R

Is there a way to "r" fine-tune two significant characters from a longer row from a column in the data table.

I have a data table in which there is a column with rows of “degrees” ... an abbreviated code for the degree that someone received and the year they finished.

> srcDT<- data.table(
    alum=c("Paul Lennon","Stevadora Nicks","Fred Murcury"),
    degree=c("W72","WG95","W88")
    )

> srcDT
               alum degree
1:      Paul Lennon    W72
2:  Stevadora Nicks   WG95
3:     Fred Murcury    W88

I need to extract year digits from a degree and put it in a new column called "degree_year"

No problems:

> srcDT[,degree_year:=substr(degree,nchar(degree)-1,nchar(degree))]

> srcDT
                alum degree degree_year
 1:      Paul Lennon    W72          72
 2:  Stevadora Nicks   WG95          95
 3:     Fred Murcury    W88          88

If it were always that simple. The problem is that degree strings sometimes look like higher. Most often they look like this:

srcDT<- data.table(
  alum=c("Ringo Harrison","Brian Wilson","Mike Jackson"),
  degree=c("W72 C73","WG95 L95","W88 WG90")
)

I am only interested in 2 numbers next to the characters that excite me: W and WG (and if there is W and WG, I only care about WG)

Here's how I solved it:

x <-srcDT$degree                     ##grab just the degree column
z <-character()                       ## create an empty character vector
degree.grep.pattern <-c("WG[0-9][0-9]","W[0-9][0-9]")
                                     ## define a vector of regex's, in the order
                                     ## I want them

for(i in 1:length(x)){               ## loop thru all elements in degree column
  matched=F                          ## at the start of the loop, reset flag to F
  for(j in 1:length(degree.grep.pattern)){
                                     ## loop thru all elements of the pattern vector

    if(length(grep(degree.grep.pattern[j],x[i]))>0){
                                     ## see if you get a match

      m <- regexpr(degree.grep.pattern[j],x[i])
                                     ## if you do, great! grab the index of the match
      y<-regmatches(x[i],m)          ## then subset down.  y will equal "WG95"
      matched=T                      ## set the flag to T
      break                          ## stop looping
    }
                                     ## if no match, go on to next element in pattern vector
  }

  if(matched){                       ## after finishing the loop, check if you got a match
    yr <- substr(y,nchar(y)-1,nchar(y))
                                     ## if yes, then grab the last 2 characters of it
  }else{
    #if you run thru the whole list and don't match any pattern at all, just
    # take the last two characters from the affilitation
    yr <- substr(x[i],nchar(as.character(x[i]))-1,nchar(as.character(x[i])))
  }
  z<-c(z,yr)                         ## add this result (95) to the character vector
}
srcDT$degree_year<-z                ## set the column to the results.

> srcDT
             alum   degree degree_year
1: Ringo Harrison  W72 C73          72
2:   Brian Wilson WG95 L95          95
3:   Mike Jackson W88 WG90          90

. 100% . , . , . 10k 100k , .

, ? "C". "R."

?

. . 30 , - 540 . , .grep.pattern . , , 7 8 .

+4

regex r data.table

Ben Adams 26 . '16 0:02

4

David Arenburg · Answer 1 · 2016-01-26T14:27:30+0000

( OPs) , "WG W",

srcDT[ , degree_year := gsub(".*WG?(\\d+).*", "\\1", degree)]
srcDT
#              alum   degree degree_year
# 1: Ringo Harrison  W72 C73          72
# 2:   Brian Wilson WG95 L95          95
# 3:   Mike Jackson W88 WG90          90

MichaelChirico · Answer 2 · 2016-01-26T13:13:29+0000

, W:

regex <- "(?<=W|(?<=W)G)[0-9]{2}"

srcDT[ , degree_year := 
         sapply(regmatches(degree, 
                           gregexpr(regex, degree, perl = TRUE)),
                function(x) max(as.integer(x)))]

> srcDT
             alum   degree degree_year
1: Ringo Harrison  W72 C73          72
2:   Brian Wilson WG95 L95          95
3:   Mike Jackson W88 WG90          90

:

degree.grep.pattern 2 . , , 7 8 .

, . , W WG?

Karolis Koncevičius · Answer 3 · 2016-01-26T00:35:23+0000

:

# split all words from degree and order so that WG is before W
words <- lapply(strsplit(srcDT$degree, " "), sort, decreasing=TRUE)

# obtain tags for each row (getting only first. But works since ordered)
tags <- mapply(Find, list(function(x) grepl("^WG|^W", x)), words)

# simple gsub to remove WG and W
(result <- gsub("^WG|^W", "", tags))
[1] "72" "95" "90"

100k .

Moody_Mudskipper · Answer 4 · 2016-01-26T01:07:01+0000

, , ... , .

years , , .

degreeyear_split <- sapply(srcDT$degree,strsplit," ") 
for(i in 1:nrow(srcDT)){
  for (degree_year in degreeyear_split[[i]]){
    n <- nchar(degree_year)
    degree <- substr(degree_year,1,n-2)
    year <- substr(degree_year,n-1,n)
    srcDT[i,degree] <- year  
  }}

, W , WG .

srcDT$year <- srcDT$W
srcDT$year[srcDT$WG!=""]<-srcDT$WG[srcDT$WG!=""]

Then here you get:

srcDT
             alum   degree  W  C WG  L year
1: Ringo Harrison  W72 C73 72 73         72
2:   Brian Wilson WG95 L95       95 95   95
3:   Mike Jackson W88 WG90 88    90      90

Subscript characters in a column in a data table. in R

More articles: