How to get a sample with the exact sample size in Spark RDD?

Question

How to get a sample with the exact sample size in Spark RDD?

why does the "sample" function on Spark RDD return a different number of elements, even if the fraction parameter is the same? For example, if my code looks like this:

val a = sc.parallelize(1 to 10000, 3) a.sample(false, 0.1).count

Each time I run the second line of code, it returns a different number, not equal to 1000. In fact, I expect to see 1000 each time, although 1000 elements may be different. Can someone tell me how can I get a sample with a sample size equal to 1000? Thank you very much.

+5

sample apache-spark rdd

Carter Sep 29 '15 at 6:55

source share

2 answers

Another way could be to takeSample first and then do an RDD. This can be slow with large datasets.

 sc.makeRDD(a.takeSample(false, 1000, 1234))

+1

Laeeq Nov 01 '16 at 11:19

source share

Bhashit parikh · Accepted Answer · 2015-09-29T07:15:25+0000

If you need an exact sample, try to make

 a.takeSample(false, 1000)

But note that this returns an array, not an RDD .

As for why a.sample(false, 0.1) does not return the same sample size that, because of the spark, it internally uses something called a Bernoulli sample to take the sample. The fraction argument is not a fraction of the actual size of the RDD. This represents the likelihood that each element in the population will be selected for sampling, and as Wikipedia says:

Since each element of the population is considered separately for the sample, the sample size is not fixed, but rather follows the binomial distribution.

And this essentially means that the number does not remain fixed.

If you set the first argument to true , then it will use something called a Poisson fetch , which also leads to a non-deterministic resulting sample size.

Update

If you want to stick to the sample method, you can specify a high probability for the fraction parameter, and then call take , as in:

 a.sample(false, 0.2).take(1000)

This should in most cases, but not always, result in a sample size of 1000. This can work if you have a large enough population.

How to get a sample with the exact sample size in Spark RDD?

More articles: