Convert a draw in Matlab from a Gaussian mixture to a homogeneous one

Consider the following figures for a 2x1 vector in Matlab with a probability distribution that is a mixture of two Gaussian components.

 P=10^3; %number draws v=1; %First component mu_a = [0,0.5]; sigma_a = [v,0;0,v]; %Second component mu_b = [0,8.2]; sigma_b = [v,0;0,v]; %Combine MU = [mu_a;mu_b]; SIGMA = cat(3,sigma_a,sigma_b); w = ones(1,2)/2; %equal weight 0.5 obj = gmdistribution(MU,SIGMA,w); %Draws RV_temp = random(obj,P);%Px2 % Transform each component of RV_temp into a uniform in [0,1] by estimating the cdf. RV1=ksdensity(RV_temp(:,1), RV_temp(:,1),'function', 'cdf'); RV2=ksdensity(RV_temp(:,2), RV_temp(:,2),'function', 'cdf'); 

Now, if we check whether RV1 and RV2 evenly distributed on [0,1] , doing

 ecdf(RV1) ecdf(RV2) 

we see that RV1 evenly distributed on [0,1] (the empirical cdf is close to the 45 degree line), but RV2 is not.

I do not understand why. It seems that the farther mu_a(2) and mu_b(2) , the worse the work with ksdensity with a reasonable number of draws. Why?

+5
source share
2 answers

If you have a mixture of N (0.5, v) and N (8.2, v), then the range of the generated data is greater than if you expected that you were closer, for example N (0, v) and N (0, v ), as in another dimension. Then you ask ksdensity approximate the function using the P points within this range.

As in standard linear interpolation, the denser the points, the better the approximation of the function (within the range), this is the same case here. Thus, in N (0.5, v) and N (8.2, v), where the points are β€œsparse” (or more rare, is this the word?), The approximation is worse than in N (0, v) and N (0 , v), where the points are denser.

As a small note, is there a reason why you are not applying ksdensity directly to two-dimensional data? Also I can not reproduce your comment, where you say that the 5e2 points 5e2 also good. A final comment of 1e3 usually preferable to 10^3 .

+2
source

I think this is just the number of samples you use. In the first example, the means of the two Gaussians are relatively close, so thousands of samples are enough to get a cdf that really covers U [0,1] cdf. However, on the second vector, you have a higher difference and need more samples. With 100,000 samples, I got the following result:

Result with 100,000 samples

From 1000 I got this:

Result with 1000 samples

This is clearly different from the unified cdf function. Try increasing the number of samples to a million and see if the result gets closer again.

0
source

Source: https://habr.com/ru/post/1270152/


All Articles