Find overlapping ranges between two interval data

I have one table with coordinates ( start , end ) approx. 500,000 fragments and another table with 60,000 single coordinates, which I would like to compare with the previous fragments. Those. for each record from the dtCoords table dtCoords I need to search for a record in the dtFrags table with the same chr and start <= coord <= end (and get type from this dtFrags record). Is it good to use R for this at all, or should I look better at other languages?

Here is my example:

 require(data.table) dtFrags <- fread( "id,chr,start,end,type 1,1,100,200,exon 2,2,300,500,intron 3,X,400,600,intron 4,2,250,600,exon ") dtCoords <- fread( "id,chr,coord 10,1,150 20,2,300 30,Y,500 ") 

In the end, I would like to have something like this:

 "idC,chr,coord,idF,type 10, 1, 150, 1, exon 20, 2, 300, 2, intron 20, 2, 300, 4, exon 30, Y, 500, NA, NA " 

I can simplify the task a bit by dividing the table into subtopics by chr , so I would focus only on the coordinates

 setkey(dtCoords, 'chr') setkey(dtFrags, 'chr') for (chr in unique(dtCoords$chr)) { dtCoordsSub <- dtCoords[chr]; dtFragsSub <- dtFrags[chr]; dtCoordsSub[, { # ???? }, by=id] } 

but it’s still not clear to me how to work inside ... I would be very grateful for any tips.

UPD just in case, I put my real table in the archive here . After unpacking, tables can be loaded into the working directory with the following code:

 dtCoords <- fread("dtCoords.txt", sep="\t", header=TRUE) dtFrags <- fread("dtFrags.txt", sep="\t", header=TRUE) 
+6
source share
2 answers

In general, it’s very convenient to use the IRocanos bioconductor package to solve interval problems. He does this efficiently by implementing an interval tree . GenomicRanges is another package built on top of IRanges , specifically for processing, well, "Genomic Ranges".

 require(GenomicRanges) gr1 = with(dtFrags, GRanges(Rle(factor(chr, levels=c("1", "2", "X", "Y"))), IRanges(start, end))) gr2 = with(dtCoords, GRanges(Rle(factor(chr, levels=c("1", "2", "X", "Y"))), IRanges(coord, coord))) olaps = findOverlaps(gr2, gr1) dtCoords[, grp := seq_len(nrow(dtCoords))] dtFrags[subjectHits(olaps), grp := queryHits(olaps)] setkey(dtCoords, grp) setkey(dtFrags, grp) dtFrags[, list(grp, id, type)][dtCoords] grp id type id.1 chr coord 1: 1 1 exon 10 1 150 2: 2 2 intron 20 2 300 3: 2 4 exon 20 2 300 4: 3 NA NA 30 Y 500 
+7
source

It works? You can use merge first and then subset

  kk<-merge(dtFrags,dtCoords,by="chr",all.x=TRUE) > kk chr id.x start end type id.y coord 1: 1 1 100 200 exon 10 150 2: 2 2 300 500 intron 20 300 3: 2 4 250 600 exon 20 300 4: X 3 400 600 intron NA NA kk[coord>=start & coord<=end] chr id.x start end type id.y coord 1: 1 1 100 200 exon 10 150 2: 2 4 250 600 exon 20 300 
+3
source

Source: https://habr.com/ru/post/957267/


All Articles