How to split a data table into groups and use a subset of occults in columns?

I have a large dataset, 287046 x 18, that looks like this (only partial view):

tdf geneSymbol peaks 16 AK056486 Pol2_only 13 AK310751 no_peak 7 BC036251 no_peak 10 DQ575786 no_peak 4 DQ597235 no_peak 5 DQ599768 no_peak 11 DQ599872 no_peak 12 DQ599872 no_peak 2 FAM138F no_peak 15 FAM41C no_peak 34116 GAPDH both 283034 GAPDH Pol2_only 6 LOC100132062 no_peak 9 LOC100133331 no_peak 14 LOC100288069 both 8 M37726 no_peak 3 OR4F5 no_peak 17 SAMD11 both 18 SAMD11 both 19 SAMD11 both 20 SAMD11 both 21 SAMD11 both 22 SAMD11 both 23 SAMD11 both 24 SAMD11 both 25 SAMD11 both 1 WASH7P Pol2_only 

What I want to do is extract (1) geneSymbols, which are either "Pol2_only" or "both", and then; (2) just the Symbols genes, which are "Pol2_only", but not "both." For example, GAPDH will fulfill condition 1, but not 2.

I tried plyr with something like this (there is an additional condition there, please ignore it):

 ## grab genes with both peaks pol2.peaks <- ddply(filem, .(geneSymbol), function(dfrm) subset(dfrm, peaks == "both" | (peaks == "Pol2_only" & peaks == "CBP20_only")), .parallel=TRUE) ## grab genes pol2 only peaks pol2.only.peaks <- ddply(tdf, .(geneSymbol), function(dfrm) subset(dfrm, peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only"), .parallel=TRUE) 

But it takes a lot of time and still returns the wrong answer. For example, the answer for 2:

 pol2.only.peaks geneSymbol peaks 1 AK056486 Pol2_only 2 GAPDH Pol2_only 3 WASH7P Pol2_only 

As you can see, GAPDH should not be there. My implementation in the data.table (which is much preferable and therefore preferable) also gives the same result:

 filem.dt <- as.data.table(tdf) setkey(filem.dt, "geneSymbol") test.dt <- filem.dt[ , .SD[ peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only"]] test.dt geneSymbol peaks 1: AK056486 Pol2_only 2: GAPDH Pol2_only 3: WASH7P Pol2_only 

It seems that the problem is that the subset works on a consistent basis, whereas I need it to be applied to the subgroup of the Symbol gene as a whole.

Can a subset in a group help me? A data.table solution would be welcome because it is faster, but plyr (or even basic R) is fine. A solution that adds an extra column indicating the nature of the peak would be ideal. Here is what I mean:

 tdf geneSymbol peaks newCol 16 AK056486 Pol2_only Pol2_only 13 AK310751 no_peak no_peak 7 BC036251 no_peak no_peak 10 DQ575786 no_peak no_peak 4 DQ597235 no_peak no_peak 5 DQ599768 no_peak no_peak 11 DQ599872 no_peak no_peak 12 DQ599872 no_peak no_peak 2 FAM138F no_peak no_peak 15 FAM41C no_peak no_peak 34116 GAPDH both both 283034 GAPDH Pol2_only both 6 LOC100132062 no_peak no_peak 9 LOC100133331 no_peak no_peak 14 LOC100288069 both both 8 M37726 no_peak no_peak 3 OR4F5 no_peak no_peak 17 SAMD11 both both 18 SAMD11 both both 19 SAMD11 both both 20 SAMD11 both both 21 SAMD11 both both 22 SAMD11 both both 23 SAMD11 both both 24 SAMD11 both both 25 SAMD11 both both 1 WASH7P Pol2_only Pol2_only 

Check the GAPDH again, which is now both in 2 lines. Here are the data:

 dput(tdf) structure(list(geneSymbol = c("AK056486", "AK310751", "BC036251", "DQ575786", "DQ597235", "DQ599768", "DQ599872", "DQ599872", "FAM138F", "FAM41C", "GAPDH", "GAPDH", "LOC100132062", "LOC100133331", "LOC100288069", "M37726", "OR4F5", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "WASH7P"), peaks = c("Pol2_only", "no_peak", "no_peak", "no_peak", "no_peak", "no_peak", "no_peak", "no_peak", "no_peak", "no_peak", "both", "Pol2_only", "no_peak", "no_peak", "both", "no_peak", "no_peak", "both", "both", "both", "both", "both", "both", "both", "both", "both", "Pol2_only")), .Names = c("geneSymbol", "peaks"), row.names = c(16L, 13L, 7L, 10L, 4L, 5L, 11L, 12L, 2L, 15L, 34116L, 283034L, 6L, 9L, 14L, 8L, 3L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 1L), class = "data.frame") 

Thanks!

edit ** I found a workaround for the problem. The choice was made in turn. All that is needed is a hack, that is, that ALL values ​​are true in the returned logical vector. So here is what I did with the plyr function:

 ddply(tdf, .(geneSymbol), function(dfrm) subset(dfrm, all(peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only")), .parallel=TRUE) geneSymbol peaks 1 AK056486 Pol2_only 2 WASH7P Pol2_only 

Pay attention to the use of all in combination with the conditions. Now the results are expected, i.e. Only "Pol2_only" (redundant warning). What remains to be done is the implementation in data.table, which I tried but could not. Any help?

I did not write the answer to my question in anticipation that someone comes up with a better solution in data.table.

+4
source share
2 answers

You do not need plyr for this.

 a <- tdf$geneSymbol[tdf$peaks %in% c("both", "Pol2_only")] b <- tdf$geneSymbol[tdf$peaks != "Pol2_only"] result <- setdiff(a, b) 

And to create a new column in your data frame:

 tdf$newcol <- with(tdf, ifelse(geneSymbol %in% result, "Pol2 only", ifelse(geneSymbol %in% a, "both", "no_peak"))) 
+1
source

As you requested a solution to data.table.

 # set the key to be "peaks TDF <- data.table(tdf, key = c('geneSymbol','peaks')) # use unique to get unique combinations, then for each geneSymbol get the first # match (we have keyed by peak soboth < Pol2_only < no_peak within each # geneSymbol ) # then exclude those with "peak == "no_peak") unique(TDF)[.(unique(geneSymbol)), mult = 'first'][!peaks =='no_peak'] # geneSymbol peaks # 1: AK056486 Pol2_only # 2: GAPDH both # 3: LOC100288069 both # 4: SAMD11 both # 5: WASH7P Pol2_only 
+4
source

Source: https://habr.com/ru/post/1494241/


All Articles