Without any restriction on the intended distribution, the empirical distribution function would be a natural candidate (see Wikipedia ). For this distribution, there are very good theorems on convergence to a real distribution (see Dvoretsky-Kiefer-Wolfowitz inequality ).
With this choice, the selection is especially simple. If dataset is a list of current samples, then dataset[rand(1:length(dataset),sample_size)] is a collection of new samples from an empirical distribution. With Distributions, this can be more readable, for example:
using Distributions new_sample = sample(dataset,sample_size)
Finally, an estimate of the density of the nucleus is also good, but you may need to choose a parameter (the kernel and its width). This shows preference for some family of distributions. Sampling from a core distribution is surprisingly similar to a sample from an empirical distribution: 1. Select a sample from the empirical distribution; 2. Disturb each sample using a core function sample.
For example, if the kernel function is a normal distribution of width w , then the perturbed sample can be calculated as:
new_sample = dataset[rand(1:length(dataset),sample_size)]+w*randn(sample_size)
source share