The best option is to use RDD.toLocalIterator , which only collects one section at a time. It creates a standard Python generator:
rdd = sc.parallelize(range(100000)) iterator = rdd.toLocalIterator() type(iterator)
You can adjust the amount of data collected in one batch using a specific partitioning element and adjust the number of partitions.
Unfortunately, it comes with a price. To assemble small batches, you need to run a few Spark jobs, and it's quite expensive. Thus, generally speaking, collecting items at that time is not an option.
source share