Collect RDD with buffer in pyspark

Question

Collect RDD with buffer in pyspark

I would like to return rows from my RDD one at a time (or in small batches) so that I can collect rows locally when I need them. My RDD is big enough that it cannot fit into the memory of the node name, so running collect() will result in an error.

Is there a way to recreate the collect() operation, but with a generator so that the lines from the RDD are passed to the buffer? Another option would be to take() 100,000 rows at a time from cached RDD, but I don't think take() allows you to specify a starting position?

+5

apache-spark pyspark

mgoldwasser Nov 19 '15 at 19:14

source share

1 answer

zero323 · Accepted Answer · 2015-11-19T20:58:24+0000

The best option is to use RDD.toLocalIterator , which only collects one section at a time. It creates a standard Python generator:

 rdd = sc.parallelize(range(100000)) iterator = rdd.toLocalIterator() type(iterator) ## generator even = (x for x in iterator if not x % 2)

You can adjust the amount of data collected in one batch using a specific partitioning element and adjust the number of partitions.

Unfortunately, it comes with a price. To assemble small batches, you need to run a few Spark jobs, and it's quite expensive. Thus, generally speaking, collecting items at that time is not an option.

Collect RDD with buffer in pyspark

More articles: