Finding a set of ranges whose number falls

I have a list of lines in 200 thousand lines, such as start_position, stop position. The list includes all types of floors in addition to non-overlapping ones.

the list is as follows

  • [3,5]
  • [10,30]
  • [15.25]
  • [5.15]
  • [25.35]
  • ...

I need to find ranges that include a specific number. And repeat it for 100 thousand numbers. For example, if 18 is a given number with a list above, then the function should return [10,30] [15,25]

I am doing this in an overly complicated way using bisect, can someone let me know how to do this faster.

thanks

+6
source share
6 answers

There seems to be a problem with range coverage. Since you have a large number of requests for processing, you need to give the result as quickly as possible. There is an algorithm associated with this problem, you can take a look at the Segment Tree .

The idea is to first build the Segment Tree based on the given intervals, and then for each query you can do as fast as log(N) , where N is the number of intervals.

To get all possible intervals of k , first find the parent node spanning all the intervals with log(N) , and then go to its children to get all k intervals. Thus, the time complexity of extracting all intervals for a given number is limited by log(N) + O(k) , where k << n .

This algorithm can be degraded to as slow as O(N) when all intervals contain a given number. In this case, since the output size is N , there is no better solution. Since the complexity of the algorithm should be at least in the same order or higher than the output size.

Hope this helps.

+10
source

The best way to solve this is to build an interval tree. (The range tree defined by sza is static. This means that after you create it, you cannot add / remove nodes. If this is normal with you, then its answer will be enough.)

Here you can find out about the interval tree:

Youtube: https://www.youtube.com/watch?v=q0QOYtSsTg4

Wiki: https://en.wikipedia.org/wiki/Interval_tree

The span tree is basically a binary tree based on the left value of the range tuple. ([left, right]). What is special is that each node in the tree has a value called max. The maximum value contains the largest right value of the nodes that are below this node, including the node itself. In other words, the current node will have a max value set to the largest right value of the subtree, which the current node is the root of.

To expand this for your problem, we can add a minimum value. If the min value will contain the smallest left value of the entire subtree, where this node is the root.

Edited screenshot from video: (sorry for the quality)

enter image description here

This means that when we request a number, you:

 1) add current node to set of answers if number is inside range 2) go left if number is less than max value of left child. 3) go right if number is greater than min value of right child. 
+3
source
 import random as r rng = range(99) lefts = [r.choice(rng) for x in range(0, 200000)] segments = [(x, r.choice(range(x+1, 100))) for x in lefts] leftList = sorted(segments, key=lambda x:x[0]) rightList = sorted(segments, key=lambda x:x[1]) def returnSegments(item, segments, leftList, rightList): for leftN in range(len(leftList)): if leftList[leftN][0] > item: break for rightN in range(len(rightList)): if rightList[rightN][1] > item: break return list(set(leftList[:leftN]) & set(rightList[rightN:])) def testReturnSegments(): for item in range(0,100): item = item % 100 returnSegments(item, segments, leftList, rightList) 

This code uses 200 thousand segments and 100 numbers. It worked on my 2.3 GHz i5 macbook in 9.3 seconds, i.e. you would need 2.5 hours to fully launch 200k x 100k. Probably not fast enough, but there may be ways to speed up this approach - I'll think about it.

+2
source

So here is the tree implementation:

 import itertools class treeNode: def __init__(self, segments): self.value = None self.overlapping = None self.below = None self.above = None if len(segments) > 0: self.value = sum(itertools.chain.from_iterable(segments))/(2*len(segments)) self.overlapping = set() belowSegs = set() aboveSegs = set() for segment in segments: if segment[0] <= self.value: if segment[1] < self.value: belowSegs.add(segment) else: self.overlapping.add(segment) else: aboveSegs.add(segment) self.below = treeNode(belowSegs) self.above = treeNode(aboveSegs) else: self.overlapping = None def getOverlaps(self, item): if self.overlapping is None: return set() if item < self.value: return {x for x in self.overlapping if x[0]<=item and item<=x[1]} | self.below.getOverlaps(item) elif item > self.value: return {x for x in self.overlapping if x[0]<=item and item<=x[1]} | self.above.getOverlaps(item) else: return self.overlapping 

Demo

 import random as r maxInt = 100 numSegments = 200000 rng = range(maxInt-1) lefts = [r.choice(rng) for x in range(0, numSegments)] segments = [(x, r.choice(range(x+1, maxInt))) for x in lefts] # Construct a list of 200,000 random segments from integers between 0 and 100 tree = treeNode(segments) # Builds the tree structure based on a list of segments/ranges def testReturnSegments(): for item in range(0,100000): item = item % maxInt overlaps = tree.getOverlaps(item) import cProfile cProfile.run('testReturnSegments()') 

This uses 200 thousand segments and finds overlapping segments for 100 thousand numbers and works after 94 seconds on my 2.3 GHz Intel i5 processor. Here is the output of cProfile:

OUTPUT

  1060004 function calls (580004 primitive calls) in 95.700 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 95.700 95.700 <string>:1(<module>) 580000/100000 11.716 0.000 93.908 0.001 stackoverflow.py:27(getOverlaps) 239000 43.461 0.000 43.461 0.000 stackoverflow.py:31(<setcomp>) 241000 38.731 0.000 38.731 0.000 stackoverflow.py:33(<setcomp>) 1 1.788 1.788 95.700 95.700 stackoverflow.py:46(testReturnSegments) 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 1 0.004 0.004 0.004 0.004 {range} 
+1
source
 def get_containing_intervals(x): for start, end in intervals: if start <= x <= end: yield x 

The use of <= here is based on the assumption that the intervals are inclusive (closed).

If you are going to call this function many times, you probably want to do some preprocessing, for example. what @sza offers.

0
source

What about,

  • sort by first column O (n log n)

  • binary search to search for indices outside the valid range O (log n)

  • throw values ​​out of range

  • sort by second column O (n log n)

  • binary search to search for indices outside the valid range O (log n)

  • throw values ​​out of range

  • you stay with values ​​in the range

It should be O (n log n)

You can sort rows and columns using np.sort, and a binary search should only consist of a few lines of code.

If you have many queries, you can save the first sorted copy for subsequent calls, but not the second. Depending on the number of queries, it may be better to perform a linear search than to sort and then search.

0
source

Source: https://habr.com/ru/post/951530/


All Articles