Collect RDD with buffer in pyspark

I would like to return rows from my RDD one at a time (or in small batches) so that I can collect rows locally when I need them. My RDD is big enough that it cannot fit into the memory of the node name, so running collect() will result in an error.

Is there a way to recreate the collect() operation, but with a generator so that the lines from the RDD are passed to the buffer? Another option would be to take() 100,000 rows at a time from cached RDD, but I don't think take() allows you to specify a starting position?

+5
source share
1 answer

The best option is to use RDD.toLocalIterator , which only collects one section at a time. It creates a standard Python generator:

 rdd = sc.parallelize(range(100000)) iterator = rdd.toLocalIterator() type(iterator) ## generator even = (x for x in iterator if not x % 2) 

You can adjust the amount of data collected in one batch using a specific partitioning element and adjust the number of partitions.

Unfortunately, it comes with a price. To assemble small batches, you need to run a few Spark jobs, and it's quite expensive. Thus, generally speaking, collecting items at that time is not an option.

+5
source

Source: https://habr.com/ru/post/1236362/


All Articles