Coverage by crossing smaller genomic interval data over larger genomic intervals using R

I want to cross two genomic intervals in R. And I want to get coverage statistics for a smaller interval over a larger interval.

Interval big data is a data frame like this ....

Chr  Start     End       Name         Val    Strand
chr7 145444998 146102295 CCDS5889.1   0      +
chr7 146102406 146167735 CCDS5889.1   0      +
chr7 146167929 146371931 CCDS5889.1   0      +

A shorter spacing with more than 2 million lines is as follows.

Chr  Start     End       Name         Val    Strand PhyloP   
chr7 145444386 145444387 CCDS5889.1   0      +      0.684764
chr7 145444387 145444388 CCDS5889.1   0      +      0.684764
chr7 145444388 145444389 CCDS5889.1   0      +      0.684764
chr7 145444389 145444390 CCDS5889.1   0      +      0.684764

The interval data is in the 2nd (from) and third (to) columns in both data frames.

The situation is similar to

Large Interval:    [-----]   [-----]     [--------------]   [-------------------]
Small Interval: |||  ||||  |||||||||||  ||||||||   ||||  || |||||||||   ||    ||||||||
  • I want to know how much of each larger interval is covered by smaller intervals.
  • In addition, I would associate the intersecting values ​​of $ PhyloP for each of the large intervals for subsequent access to build.
+3
source share
1 answer

Bianges Inverter GenomicRanges findOverlaps, countOverlaps, coverage , . GRanges subject ( " " ) query ( " " ) . . , , browseVignettes("GenomicRanges")

,

sdf <- read.table(textConnection(
"Chr  Start     End       Name         Val    Strand
chr7 145444998 146102295 CCDS5889.1   0      +
chr7 146102406 146167735 CCDS5889.1   0      +
chr7 146167929 146371931 CCDS5889.1   0      +"), header=TRUE)

qdf <- read.table(textConnection(
"Chr  Start     End       Name         Val    Strand PhyloP   
chr7 145444386 145444387 CCDS5889.1   0      +      0.684764
chr7 145444387 145444388 CCDS5889.1   0      +      0.684764
chr7 145444388 145444389 CCDS5889.1   0      +      0.684764
chr7 145444389 145444390 CCDS5889.1   0      +      0.684764"), header=TRUE)

GRanges

library(GenomicRanges)
subj <-
    with(sdf, GRanges(Chr, IRanges(Start, End), Strand, Val=Val))
query <-
    with(qdf, GRanges(Chr, IRanges(Start, End), Strand, Val=Val,
                      PhyloP=PhyloP, names=Name))
intersect(query, subj)

> intersect(query, subj)
GRanges with 0 ranges and 0 elementMetadata values
     seqnames ranges strand |

seqlengths
 chr7
   NA

, ,

tiles <- successiveIRanges(rep(100, 950), 900, 145444998)
query <- GRanges("chr7", tiles, "+")

, , ,

int <- intersect(query, subj)
tapply(int, subjectHits(findOverlaps(int, subj)),
       function(x) sum(width(x)))

    1     2     3 
65800  6500 20400 
+2
source

Source: https://habr.com/ru/post/1796427/


All Articles