Data structure / query algorithm: filter by A, sort by B, return N results

Imagine that you have a large set of #m objects with properties A and B What data structure can you use as an index (or which algorithm) to improve the performance of the next query?

 find all objects where A between X and Y, order by B, return first N results; 

That is, filter by range A and sort by B , but return only the first few results (say, no more than 1000). Inserts are very rare, so heavy pretreatment is acceptable. I am not happy with the following options:

  • With records (or index) sorted by B : scan case / index in B order, return the first N , where A matches XY. In the worst cases (several objects correspond to the XY range or matches are at the end of the records / index), this becomes O(m) , which is not enough for large data sets of size m .

  • With entries (or index) sorted by A : do a binary search until the first object that matches the XY range is found. Scanning and creating an array of links to all k objects that match the range. Sort the array by B, return the first N This is O(log m + k + k log k) . If k small, then it is indeed O(log m) , but if k large, then the cost of sorting becomes even worse than the cost of a linear scan across all objects m .

  • Adaptative 2/1 : do a binary search for the first XY range match (using index over A); do a binary search for the last match of the range. If the range is small, continue with algorithm 2; otherwise, back to Algorithm 1. The problem here is that we will return to Algorithm 1. Although we checked that “many” objects pass a filter, which is a good example for Algorithm 1, this “many” is no more than a constant (asymptotically scanning O(n) will always win in sorting O(k log k) ). So we still have the O(n) algorithm for some queries.

Is there an algorithm / data structure that allows you to respond to this request in sublinear time?

If not, what could be a good compromise to achieve the required performance? For example, if I can’t guarantee a return to a better ranking of objects for my property B (remember <1.0), then I can only scan part of index B. But can I do this by limiting the quality of the results in some way?

+6
source share
6 answers

The question you ask is essentially a more general version:

Q. You have a sorted list of words with a weight associated with each word, and you want all the words that shared the prefix with this q query, and you want this list to be sorted by the corresponding weight.

I'm right?

If so, you can check out this article, which discusses how to do this in O (k log n) time, where k is the number of elements in the desired output, and n is the number of entries in the original input set. Suppose k> log n.

http://dhruvbird.com/autocomplete.pdf

(I am the author).

Update. Another refinement that I can add is that the question you ask is related to finding a two-dimensional range in which you want everything in a given range X and top K from the previous set sorted by Y -range.

2D range search allows you to find everything in the X / Y range (if both ranges are known). In this case, you only know the range X, so you will need to re-run the query and binary search on the Y range until you get the results K. Each query can be executed using O (log n) time if you use fractional cascading and O (log 2 n) if you use the naive approach. Any of these are sublinear, so you should be fine.

In addition, the time to list all entries will add an extra O (k) factor to your runtime.

+2
source

Assuming N << k < n , this can be done in O(logn + k + NlogN) , similar to what you proposed in option 2, but it saves some time, you do not need to sort all the elements of k, but only N, which is a lot smaller!

database is sorted by A.

 (1) find the first element and last element, and create a list containing these elements. (2) find the N'th biggest element, using selection algorithm (*), and create a new list of size N, with a second iteration: populate the last list with the N highest elements. (3) sort the last list by B. 

Selection Algorithm : Find the Nth Biggest Element. here O(n) or O(k) , since the size of the list is k.

complexity :
Step one is trivial O(logn + k) . Step 2 is O(k) [selection], and the other iteration is also O(k) , since this list has only k elements. Step 3 - O(NlogN) , a simple view, and the last list contains only N elements.

+2
source

If the number of elements you want to return is small - up to about 1% of the total number of elements - then the simple heap selection algorithm works well. See When theory fits practice . But this is not sublinear.

For expected sublinear performance, you can sort the elements by A When prompted, use a binary search to find the first element, where A >= X , and then sequentially look up the elements to A > Y using the heap selection technique mentioned in this blog post.

This should give you O(log n) for the initial search, and then O(m log k) , where m is the number of elements, where X <= A <= Y , and k is the number of returned elements. Yes, for some queries it will be O(n log k) . The deciding factor will be size m .

+2
source

Set the tree in the tree to A and for each segment, first compare the top N in the range. To query, break the input range into O (log m) segments and combine the pre-calculated results. Request time - O (N log log m + log m); space O (m log N).

+2
source

This is not a complete solution, just an idea. How about creating a quadtree on axes A and B? You would go down a tree, say, wide; then:

  • whenever you find a subtree with A values ​​outside the specified range [X, Y], you drop that subtree (and do not recurse);
  • every time you find a subtree with A-values ​​within the given range [X, Y], you add this subtree to the set S, which you build and do not return;
  • whenever you find a subtree with some A values ​​inside the range [X, Y] and some external ones, you recurs into it.

Now you have a set S of all maximal subtrees with A-coordinates between X and Y; i.e. no more than O (sqrt (m)) of these subtrees, which I will show below.

Some of these subtrees will contain O (m) entries (of course, they will contain O (m) entries that are all added together), so we can’t do anything for all entries of all subtrees. Now we can make a bunch of subtrees in S, so the B-minimum of each subtree is less than the B-minimums of its children in the heap. Now extract the B-minimal elements from the top of the node heap until you have N of them; whenever you retrieve an element from a subtree with elements k, you need to decompose this subtree into O (log (k)) subtrees that do not contain the newly selected element.

Now consider the complexity. Finding subtrees O (sqrt (m)) will take no more than step O (sqrt (m)) (an exercise for the reader, using the arguments in the proof below). We should probably put them in a bunch when we find them; it takes a step O (sqrt (m) * log (sqrt (m))) = O (sqrt (m) * log (m)). Extracting one element from the subtree of k-elements in the heap takes O (sqrt (k)) time to find the element, and then insert the O (log (sqrt (k))) = O (log (k)) submenu into a heap of size O (sqrt (m)) takes O (log (k) * log (sqrt (m))) = O (log (k) * log (m)). We can probably be smarter using potentials, but we can at least bind k to m, so leave N * (O (sqrt (k) + log (k) * log (m))) = O ( N * (sqrt (m) + log (m) ^ 2) = O (N * sqrt (m)) for extraction, and O (sqrt (m) * (N + log (m))) steps in total ... that sublinear in m.


Here is the proof of the estimate of the subsets O (sqrt (m)). There are several strategies for constructing a quadrant, but for the convenience of analysis, suppose we create a binary tree; in the root of the node, we break the data set by the A-coordinate around the point with the median A-coordinate, then one level down we break the data set by the B-coordinate around the point with the middle B-coordinate (which is the median for half the points contained in this half-element ) and continue alternating directions to the level.

The height of the tree is log (m). Now consider how many subtrees we need to recurse. We only need to rewrite if the subtree contains the A-coordinate of X or contains the A-coordinate of Y or both. At the (2 * k) -th level down there are only 2 ^ (2 * k) subtrees. By that time, each subtree has its own A-range divided by k times already, and each time we do this, only half of the trees contain the A-coordinate X. Therefore, no more than 2 ^ k subtrees contain the A-coordinate X. Similarly, at most 2 ^ k will contain the A-coordinate Y. This means that in total we recurse no more than 2 * sum (2 ^ k, k = 0. log (m) / 2) = 2 * (2 ^ (log (m) / 2 - 1) + 1) = O (sqrt (m)).

Since we explore no more than 2 ^ k subtrees at the (2 * k) 'level down, we can also add no more than 2 ^ k subtrees at this level in S. This gives the final result.

+1
source

The result that you describe is what most search engines are built for (sorting, filtering, swapping). If you have not done so already, check out a search engine such as Norch or Solr.

+1
source

Source: https://habr.com/ru/post/900147/


All Articles