If you need an exact sample, try to make
a.takeSample(false, 1000)
But note that this returns an array, not an RDD .
As for why a.sample(false, 0.1) does not return the same sample size that, because of the spark, it internally uses something called a Bernoulli sample to take the sample. The fraction argument is not a fraction of the actual size of the RDD. This represents the likelihood that each element in the population will be selected for sampling, and as Wikipedia says:
Since each element of the population is considered separately for the sample, the sample size is not fixed, but rather follows the binomial distribution.
And this essentially means that the number does not remain fixed.
If you set the first argument to true , then it will use something called a Poisson fetch , which also leads to a non-deterministic resulting sample size.
Update
If you want to stick to the sample method, you can specify a high probability for the fraction parameter, and then call take , as in:
a.sample(false, 0.2).take(1000)
This should in most cases, but not always, result in a sample size of 1000. This can work if you have a large enough population.
source share