Include only outliers from each column in the data frame

I have a dataframe as follows:

chr leftPos TBGGT 12_try 324Gtt AMN2 1 24352 34 43 19 43 1 53534 2 1 -1 -9 2 34 -15 7 -9 -18 3 3443 -100 -4 4 -9 3 3445 -100 -1 6 -1 3 3667 5 -5 9 5 3 7882 -8 -9 1 3 

I need to create a loop that:

a) Computes the upper and lower limit (UL and LL) for each column from the third column forward.
b) Includes only rows that go beyond UL and LL (Zoutliers).
c) Then count the number of lines in which Zoutlier will be the same direction (i.e., positive or negative) as the previous or next line for the same chr.

Thus, the output will be as follows:

  ZScore1 TBGGT 12_try 324Gtt AMN2 nrow 4 6 4 4 

So far I have the code as follows:

  library(data.table)#v1.9.5 f1 <- function(df, ZCol){ #A) Determine the UL and LL and then generate the Zoutliers UL = median(ZCol, na.rm = TRUE) + alpha*IQR(ZCol, na.rm = TRUE) LL = median(ZCol, na.rm = TRUE) - alpha*IQR(ZCol, na.rm = TRUE) Zoutliers <- which(ZCol > UL | ZCol < LL) #B) Exclude Zoutliers per chr if same direction as previous or subsequent row na.omit(as.data.table(df)[, {tmp = sign(eval(as.name(ZCol))) .SD[tmp==shift(tmp) | tmp==shift(tmp, type='lead')]}, by=chr])[, list(.N)]} nm1 <- paste0(names(df) setnames(do.call(cbind,lapply(nm1, function(x) f1(df, x))), nm1)[] 

The code is flashed from different places. The problem I have is combining parts A) and B) of the code to get the result I want

+6
source share
1 answer

Can you try this feature? I was not sure alpha , so I could not reproduce the expected output and included it as a variable in the function.

 # read your data per copy&paste d <- read.table("clipboard",header = T) # or as in Frank comment mentioned solution via fread d <- data.table::fread("chr leftPos TBGGT 12_try 324Gtt AMN2 1 24352 34 43 19 43 1 53534 2 1 -1 -9 2 34 -15 7 -9 -18 3 3443 -100 -4 4 -9 3 3445 -100 -1 6 -1 3 3667 5 -5 9 5 3 7882 -8 -9 1 3") # set up the function foo <- function(x, alpha, chr){ # your code for task a) and b) UL = median(x, na.rm = TRUE) + alpha*IQR(x, na.rm = TRUE) LL = median(x, na.rm = TRUE) - alpha*IQR(x, na.rm = TRUE) Zoutliers <- which(x > UL | x < LL) # part (c # factor which specifies the direction. 0 values are set as positives pos_neg <- ifelse(x[Zoutliers] >= 0, "positive", "negative") # count the occurrence per chromosome and direction. aggregate(x[Zoutliers], list(chr[Zoutliers], pos_neg), length) } # apply over the columns and get a list of dataframes with number of outliers per chr and direction. apply(d[,3:ncol(d)], 2, foo, 0.95, d$chr) 
0
source

Source: https://habr.com/ru/post/985169/


All Articles