Reverse cdf

I would like to calculate the inverse cumulative density function (inverse cdf) of a given pdf. PDF is directly set as a histogram, i.e. A vector of N equidistant components.

My current approach:

cdf = cumsum(pdf); K = 3; %// some upsampling factor maxVal = 1; %// just for my own usage - a scaling factor M = length(cdf); N = M*K; %// increase resolution for higher accuracy y = zeros(N, 1); cursor = 2; for i=1:N desiredF = (i-1)/(N-1)*maxVal; while (cursor<M && cdf(cursor)<desiredF) cursor = cursor+1; end; if (cdf(cursor)==cdf(cursor-1)) y(i) = cursor-1; else alpha = min(1, max(0,(desiredF - cdf(cursor-1))/(cdf(cursor)-cdf(cursor-1)))); y(i) = ((cursor-1)*(1-alpha) + alpha*cursor )/maxVal; end; end; y = resample(y, 1, K, 0); 

which means that I am using linear interpolation, the inverse and reducing my histogram. This is a rather ugly code, it is not very reliable (if I change the reproduction rate, I can get completely different results), and it is useless slowly ... can anyone suggest a better approach?

Note: the generalized inverse I'm trying to compute (in the case where cdf is not reversible):

 F^{-1}(t) = \inf{x \in R ; F(x)>t } 

with F cumulative density function

[EDIT: actually K = 1 (ie without increasing the sampling rate) seems to give more accurate results ...]

Thanks!

+4
source share
2 answers

If your input is specified as an irregular histogram, then simply using the built-in quantile() function, quantile() automatically calculates the data point for the specified quantile, which does the inverse CDF. If the histogram is normalized by the number of data points (which makes it a probability vector), then simply multiply it by the number of data points in the first place. See here for details of quantile() . Basically, you will assume that taking into account your histogram / data, the first parameter is fixed, which turns quantiles() into a function only from the given values โ€‹โ€‹of probability p . You can easily write a wrapper function to make it more convenient if necessary. This eliminates the need to explicitly compute CDF using cumsum() .

Added

If we assume that the histogram, bins, and the number of data points are h, b, and N respectively, then:

  h1 = N*h; %// Only if histogram frequencies have been normalized. data = []; for kk = 1:length(h1) data = [data repmat(b(kk), 1, h1(kk))]; end %// Set p to the probability you want the inv-cdf for... p = 0.5; inv_cdf = quantiles(data,p) 

Added

For solutions that must use an existing PDF vector, we can do the following. Suppose that x_old and pdf_old are histograms and histogram frequencies, respectively.

  p = 0.5; %// the inv-cdf probability that I want num_points_i_want = 100; %// the number of points I want in my histogram vector x_new = linspace(min(x_old),max(x_old),num_points_i_want); pdf_new = interp1(x_old,pdf_old,x_new); cdf_new = cumsum(pdf_new); inv_cdf = min(x_new(cdf_new >= p)); 

Alternatively, we could first create cumsum() CDF and use interp1() if this is not desirable for interpolation first.

+4
source

Well, I think I found a much shorter version that works at least as fast and accurate:

 cdf = cumsum(pdf); M = length(cdf); xx = linspace(0,1,M); invcdf = interp1(cdf,xx,xx) 

[EDIT: No, it's actually still two to three times slower than the source code ... don't ask me why! And it does not handle non-strictly monotonic functions: this causes an error: "The values โ€‹โ€‹of X must be different"]

0
source

Source: https://habr.com/ru/post/1395331/


All Articles