Find medians in multiple subranges of an unordered list

eg. given an unordered list of N elements, find the medians for the subbands 0..100, 25..200, 400..1000, 10..500, ... I donโ€™t see a better way than to go through each subband and run the standard median algorithms search.

A simple example: [5 3 6 2 4] The median for 0..3 is 5. (Not 4, since we are asking for the median of the first three elements of the original list)

+6
source share
4 answers

TARGET ELEMENTS:

If your item type is an integer, then the best way is to have the bucket for each number in any of your subranges, where each bucket is used to count the number of its associated integer found in your input elements (for example, bucket[100] stores the number 100 in your input sequence). You can basically achieve this in the following steps:

  • create buckets for each number in any of your subranges.
  • iterate over all elements for each number n , if we have bucket[n] , then bucket[n]++ .
  • calculate medians based on the aggregated values โ€‹โ€‹stored in your buckets.

Put it another way, suppose you have a subrange [0, 10] and you want to calculate the median. The bucket approach basically calculates how many 0 are present in your inputs, and how many 1 are in your inputs, and so on. Suppose that n numbers are in the range [0, 10] , then the median is n/2 th is the largest element that can be identified by finding i so that bucket[0] + bucket[1] ... + bucket[i] greater than or equal to n/2 , but bucket[0] + ... + bucket[i - 1] less than n/2 .

The best part is that even your input elements are stored on several machines (i.e. in a distributed case), each machine can support its own buckets, and only aggregated values โ€‹โ€‹should go through the intranet.

You can also use hierarchical buckets that include several passes. In each pass, bucket[i] counts the number of elements in your input that lies in a certain range (for example, [i * 2^K, (i+1) * 2^K] ), and then narrows down the problem space, determining which bucket will be located after each step, then decrease K by 1 in the next step and repeat until you can identify the media correctly.


FLOATING ELEMENTS

All elements can fit into memory:

If all of your elements can fit into memory, the best option would be to first sort the element N, and then search for the medians for each subband. A linear time heap solution also works well in this case if the number of your subbands is less than logN .

All elements cannot be inserted into memory, but stored in one machine:

Typically, this external sorting as an example) focused on this problem.

+1
source

I think that as the number of sub-bands increases, you quickly find that it is faster to sort and then retrieve the item numbers you need.

In practice, because there will be optimized sorting procedures that you can call.

In theory and perhaps in practice, too, because since you are dealing with integers, you do not need to pay n log n for sorting - see http://en.wikipedia.org/wiki/Integer_sorting .

If your data is actually floating points, not NaN, then hiding a bit will actually allow you to use integer sorting for them - from - http://en.wikipedia.org/wiki/IEEE_754-1985#Comparing_floating-point_numbers - The binary representation has a special property, which, excluding NaNs, can be compared with any two numbers, as signed and integer values โ€‹โ€‹(although this is not directly applicable to modern computer processors): if the sign bit is different, a negative number precedes a positive number (except that negative zero and positive zero must be considered equal), otherwise the relative order coincides with the lexicographic order, but is inverted for two negative numbers; there are problems with the content.

So, you can check NaN and other funs, pretend that floating point numbers are signs + integer values, subtract when negative ones correct the order of negative numbers, and then treat them like normal signed integers, sort them, and then reverse the process.

0
source

My idea:

  • Sort the list into an array (using any suitable sorting algorithm)

  • For each range, find the beginning and end indices of the range using binary search

  • Find the median by simply adding your indices and dividing by 2 (that is, the median of the range [x,y] is equal to arr[(x+y)/2] )

Preprocessing time: O (n log n) for the general sorting algorithm (e.g. quick-sort) or the execution time of the selected sorting procedure

Request Time: O (log n)

Dynamic list:

The above assumes the list is static. If the elements can be freely added or deleted between queries, the modified binary search tree tree can work, while each node saves the number of numbers of its descendants. This will allow you to use the same runtime, as described above, with a dynamic list.

0
source

Ultimately, the answer will be "dependent." There are many approaches, any of which is likely to be suitable in most cases that you may encounter. The problem is that each will work differently for different inputs. Where one may work better for one class of inputs, the other will work better for another class of inputs.

As an example, the approach of sorting and then doing a binary search at the extreme values โ€‹โ€‹of your ranges and then directly calculating the median will be useful when the number of ranges you should check is greater than log (N). On the other hand, if the number of ranges is less than log (N), it may be better to move the elements of a given range to the beginning of the array and use the linear time algorithm to find the median.

It all comes down to profiling to avoid premature optimization. If your approach does not become a bottleneck for your system performance, figuring out how to improve it will not be a useful exercise in optimizing those parts of your program that are bottlenecks.

0
source

Source: https://habr.com/ru/post/951529/


All Articles