A combination of percentiles from different data sets: how can this be done?

I need to compute the Nth percentile of a series of related but segmented datasets.

Combined datasets are too large to compute all at once due to memory limitations, but a structure for performing piecewise computations already exists. So, how can I perform calculations on each data set and then combine these calculations to find the required percentile?

Other data information:

  • Data often has outliers.

  • Individual datasets are about the same size, but not always

  • Separate datasets should not use the same distribution

Is it possible to calculate combined median, average and standard deviations, and then evaluate any percentile from there?

+4
source share
1 answer

An average, average, and standard deviation alone is unlikely to be enough, especially if you have outliers.

If exact percentiles are required, this is a parallel computation problem. Some work has been done in this direction, for example, in the parallel mode of the C ++ STL library .

If only approximate percentiles are required, then Cross Validated raises the question - Estimating quantiles of given quantiles of a subset - this suggests an approach to subsampling. You would take some (but not all) of the data from each data set, create a new combined data set that is small enough to fit on one machine and calculate the percentiles of it.

Another approximate approach, effective if the percentiles of each segment are already available, will approximate the cumulative distribution function of each segment as a step function of the percentile. Then the total distribution would be a finite mixture of segment distributions, and the cumulative distribution function would be the weighted sum of the cumulative distribution functions of the segment. The quantile function (i.e., percentiles) can be calculated by numerically inverting the cumulative distribution function.

0
source

Source: https://habr.com/ru/post/1381396/


All Articles