Pattern matching in the context of a data frame

Question

Pattern matching in the context of a data frame

I have a data frame whose first 5 lines look like this:

Sample CCT6 GAT1 IMD3 PDR3 RIM15 001 0000000000 111111111111111111111 010001000011 0N100111NNNN 01111111111NNNNNN 002 1111111111 111111111111111111000 000000000000 0N100111NNNN 00000000000000000 003 0NNNN00000 000000000000000000000 010001000011 000000000000 11111111111111111 004 000000NNN0 11100111111N111111111 010001000011 111111111111 01111111111000000 005 0111100000 111111111111111111111 111111111111 0N100111NNNN 00000000000000000

The complete data set contains 2000 samples. I am trying to write code that allows me to determine if the row of numbers for each of the 5 columns is uniform (i.e., just 1 or 0) in all of my samples. Ideally, I would also like to distinguish between 1 and 0 in cases where the answer is True . In my example, the expected results are:

 Sample CCT6 GAT1 IMD3 PDR3 RIM15 001 TRUE (0) TRUE (1) FALSE FALSE FALSE 002 TRUE (1) FALSE TRUE (0) FALSE TRUE (0) 003 FALSE TRUE (0) FALSE TRUE (0) TRUE (1) 004 FALSE FALSE FALSE TRUE (1) FALSE 005 FALSE TRUE (1) TRUE (1) FALSE TRUE (0)

I was not obsessed with using logic elements, and I could use symbols if they can be used to distinguish between different classes. Ideally, id would like to return the results to a similar data frame.

I am having problems with the very basic first step here, which is to tell R if the string consists of the same value. Ive tried to use different expressions using grep and regexpr , but could not get the result that I can use to apply the whole data frame using ddply or something like that. Here are some examples of what I tried for this step:

 a = as.character("111111111111") b = as.character("000000000000") c = as.character("000000011110") > grep("1",a) [1] 1 > grep("1",c) [1] 1 > regexpr("1",a) [1] 1 attr(,"match.length") [1] 1 > regexpr("1",c) [1] 8 attr(,"match.length") [1] 1

Id really appreciate any help to get me started with this problem or help me fulfill my big goal.

+6

r pattern-matching dataframe

Sam globus Oct 27 '11 at 3:50

source share

3 answers

Here is a REGEX expression that will match zeros or ones with one or more characters:

 (^[0]+$)|(^[1]+$)

The following will correspond: 0000 0 111111 11 1

This will not match: 000001

+5

Dan Oct 27 '11 at 4:12

source share

One possible approach would be to use strsplit and unique :

 > unique(unlist(strsplit("111111111122",""))) [1] "1" "2"

and then check if the result is one in length, and if it is “1” or “0”.

+2

joran Oct 27 '11 at 4:06

source share

Josh o'brien · Accepted Answer · 2011-10-27T08:42:05+0000

Here is the complete solution. Probably superfluous, but also fun.

The key bit is the markTRUE function. It uses a backlink ( \\1 ) to refer to a substring (either 0 or 1 ), which was previously matched with the first subexpression in parentheses.

The regular expression "^(0|1)(\\1)+$" says: "Matches any line starting with 0 or 1 and then executing (to the end of the line) for 1 or more repetitions of the same character - - whatever that is. " Later in the same gsub() call, I use the same replacement link as "TRUE (0)" or "TRUE (1)" if necessary.

First read the data:

 dat <- read.table(textConnection(" Sample CCT6 GAT1 IMD3 PDR3 RIM15 001 0000000000 111111111111111111111 010001000011 0N100111NNNN 01111111111NNNNNN 002 1111111111 111111111111111111000 000000000000 0N100111NNNN 00000000000000000 003 0NNNN00000 000000000000000000000 010001000011 000000000000 11111111111111111 004 000000NNN0 11100111111N111111111 010001000011 111111111111 01111111111000000 005 0111100000 111111111111111111111 111111111111 0N100111NNNN 00000000000000000"), header=T)

Then untie the regular expressions:

 markTRUE <- function(X) { gsub(X, pattern = "^(0|1)(\\1)+$", replacement = "TRUE (\\1)") } markFALSE <- function(X) { X[!grepl("TRUE", X)] <- "FALSE" return(X) } dat[-1] <- lapply(dat[-1], markTRUE) dat[-1] <- lapply(dat[-1], markFALSE) dat # Sample CCT6 GAT1 IMD3 PDR3 RIM15 # 1 1 TRUE (0) TRUE (1) FALSE FALSE FALSE # 2 2 TRUE (1) FALSE FALSE FALSE TRUE (0) # 3 3 FALSE TRUE (0) FALSE TRUE (0) TRUE (1) # 4 4 FALSE FALSE FALSE TRUE (1) FALSE # 5 5 FALSE TRUE (1) TRUE (1) FALSE TRUE (0)

Pattern matching in the context of a data frame

More articles: