The fastest way to detect range overlap in Perl

I have two sets of ranges. Each range is a pair of integers (beginning and end) representing a certain subrange of one larger range. Two sets of ranges are in a structure similar to this (of course ... s will be replaced by actual numbers).

$a_ranges = { a_1 => { start => ..., end => ..., }, a_2 => { start => ..., end => ..., }, a_3 => { start => ..., end => ..., }, # and so on }; $b_ranges = { b_1 => { start => ..., end => ..., }, b_2 => { start => ..., end => ..., }, b_3 => { start => ..., end => ..., }, # and so on }; 

I need to determine which ranges from set A overlap, with which they vary from set B. Given the two ranges, it is easy to determine whether they overlap. I just used a double loop to do this: swipe all the elements in set A in the outer loop, swipe all the elements in set B in the inner loop, and keep track of which ones overlap.

I have two problems with this approach. Firstly, overlap space is extremely rare - even if there are thousands of ranges in each set, I expect each range from set A to overlap, possibly with 1 or 2 ranges from set B. My approach lists every single possibility that is overkill . This leads to my second problem - that it is very weak. The code ends very quickly (will be crushed) when there are hundreds of ranges in each set, but it takes a very long time (+/- 30 minutes) when there are thousands of ranges in each set.

Is there a better way to index these ranges so that I don't do so many extra checks for overlapping?

Update . The result I'm looking for is two hashes (one for each set of ranges), where the keys are the identifiers of the range and the values ​​are the identifiers of the ranges from another set that overlap with the given range in that set.

+4
source share
5 answers

This sounds like an ideal use of the interval tree , which is a data structure specifically designed to support this operation. If you have two sets of intervals of sizes m and n, you can build one of them in the tree of time intervals O (m lg m), and then execute n queries for the intersection in time O (n lg m + k), where k is total number of intersections found. This gives a net runtime of O ((m + n) log m + k). Remember that in the worst case, k = O (nm), so this is not better than yours, but for cases where the number of intersections is sparse, it can be significantly better than the execution time O (mn) Now.

I don't have much experience with spanning trees (and zero experience in Perl, sorry!), But from the description it seems that they shouldn't be so hard to build. I would be very surprised if this were not.

Hope this helps!

+10
source

If you are not attached exclusively to perl; The IRanges package in R deals with interval arithmetic. It has very powerful primitives, it would probably be easy to code a solution with them.

The second observation is that the problem can become very simple if the intervals have an additional structure; for example, if there is no overlap within each set of ranges (in this case a linear approach is possible, sifting through two ordered sets at the same time). Even in the absence of such a structure, at least you can sort one set of ranges by start point and the other by end point, and then exit the inner loop as soon as the match is no longer possible. Of course, existing and general algorithms and data structures, such as the spanning tree mentioned earlier, are the most powerful.
+4
source

There are several existing CPAN modules that solve this problem, I developed 2 of them: Data :: Range :: Compare and Data :: Range :: Compare :: Stream

Data :: Range :: Compare only works with arrays in memory, but supports common range types.

Data :: Range :: Compare :: Stream Works with data streams through iterators, but requires an understanding of OO Perl to propagate to common data types. Data :: Range :: Compare :: Stream is recommended if you are processing very large data arrays.

Here is the Expression form from the sample folder Data :: Range :: Compare :: Stream.

Given these three data sets:

 Numeric Range set: A contained in file: source_a.src +----------+ | 1 - 11 | | 13 - 44 | | 17 - 23 | | 55 - 66 | +----------+ Numeric Range set: B contained in file: source_b.src +----------+ | 0 - 1 | | 2 - 29 | | 88 - 133 | +----------+ Numeric Range set: C contained in file: source_c.src +-----------+ | 17 - 29 | | 220 - 240 | | 241 - 250 | +-----------+ 

Expected Result:

 +--------------------------------------------------------------------+ | Common Range | Numeric Range A | Numeric Range B | Numeric Range C | +--------------------------------------------------------------------+ | 0 - 0 | No Data | 0 - 1 | No Data | | 1 - 1 | 1 - 11 | 0 - 1 | No Data | | 2 - 11 | 1 - 11 | 2 - 29 | No Data | | 12 - 12 | No Data | 2 - 29 | No Data | | 13 - 16 | 13 - 44 | 2 - 29 | No Data | | 17 - 29 | 13 - 44 | 2 - 29 | 17 - 29 | | 30 - 44 | 13 - 44 | No Data | No Data | | 55 - 66 | 55 - 66 | No Data | No Data | | 88 - 133 | No Data | 88 - 133 | No Data | | 220 - 240 | No Data | No Data | 220 - 240 | | 241 - 250 | No Data | No Data | 241 - 250 | +--------------------------------------------------------------------+ 

Source code can be found here.

 #!/usr/bin/perl use strict; use warnings; use Data::Dumper; use lib qw(./ ../lib); # custom package from FILE_EXAMPLE.pl use Data::Range::Compare::Stream::Iterator::File; use Data::Range::Compare::Stream; use Data::Range::Compare::Stream::Iterator::Consolidate; use Data::Range::Compare::Stream::Iterator::Compare::Asc; my $source_a=Data::Range::Compare::Stream::Iterator::File->new(filename=>'source_a.src'); my $source_b=Data::Range::Compare::Stream::Iterator::File->new(filename=>'source_b.src'); my $source_c=Data::Range::Compare::Stream::Iterator::File->new(filename=>'source_c.src'); my $consolidator_a=new Data::Range::Compare::Stream::Iterator::Consolidate($source_a); my $consolidator_b=new Data::Range::Compare::Stream::Iterator::Consolidate($source_b); my $consolidator_c=new Data::Range::Compare::Stream::Iterator::Consolidate($source_c); my $compare=new Data::Range::Compare::Stream::Iterator::Compare::Asc(); my $src_id_a=$compare->add_consolidator($consolidator_a); my $src_id_b=$compare->add_consolidator($consolidator_b); my $src_id_c=$compare->add_consolidator($consolidator_c); print " +--------------------------------------------------------------------+ | Common Range | Numeric Range A | Numeric Range B | Numeric Range C | +--------------------------------------------------------------------+\n"; my $format=' | %-12s | %-13s | %-13s | %-13s |'."\n"; while($compare->has_next) { my $result=$compare->get_next; my $string=$result->to_string; my @data=($result->get_common); next if $result->is_empty; for(0 .. 2) { my $column=$result->get_column_by_id($_); unless(defined($column)) { $column="No Data"; } else { $column=$column->get_common->to_string; } push @data,$column; } printf $format,@data; } print " +--------------------------------------------------------------------+\n"; 
+3
source

Try Tree :: RB, but find mutually exclusive ranges, don't overlap

Performance is not bad if I had about 10,000 segments and I had to find a segment for each discrete number. My entry had 300 million entries. I did not want to put them in separate buckets. Like data sharing. Tree :: RB works great.

 $var = [ [0,90], [91,2930], [2950,8293] . . . ] 

my search value was 10, 99, 991 ...

and basically I needed a range position for a given number

Using this comparison function below, mine uses something like this:

 my $cmp = sub { my ($a1, $b1) = @_; if(ref($b1) && ref($a1)) { return ($$a1[1]) <=> ($$b1[0]); } my $ret = 0; if(ref($a1) eq 'ARRAY') { # if($$a1[0] <= $b1 && $b1 >= $$a1[1]) { $ret = 0; } if($$a1[0] < $b1) { $ret = -1; } if($$a1[1] > $b1) { $ret = 1; } } else { if($$b1[0] <= $a1 && $a1 >= $$b1[1]) { $ret = 0; } if($$b1[0] > $a1) { $ret = -1; } if($$b1[1] < $a1) { $ret = 1; } } return $ret; } 
+1
source

I have to check the time to see if its the fastest way, but according to the structure of your data, you should try the following:

 use strict; my $fromA = 12; my $toA = 15; my $fromB = 7; my $toB = 35; my @common_range = get_common_range($fromA, $toA, $fromB, $toB); my $common_range = $common_range[0]."-".$common_range[-1]; sub get_common_range { my @A = $_[0]..$_[1]; my %B = map {$_ => 1} $_[2]..$_[3]; my @common = (); foreach my $i (@A) { if (defined $B{$i}) { push (@common, $i); } } return sort {$a <=> $b} @common; } 
+1
source

Source: https://habr.com/ru/post/1335362/


All Articles