Is this a distributed function? Will each section report its account and are the totals summarized and reported? Or is it all rdd brought in the driver and counted?
Count is distributed. In spark nomenclature, an account is an βActionβ. All actions are distributed. Indeed, there are only a few things that lead all the data to the node driver, and they are usually well-documented (for example, take, collect, etc.).
After running count (), rdd will still remain in memory or do I have to explicitly cache it?
No, the data will not be in memory. If you want this, you need to cache explicitly before counting. Sparks of lazy evaluation will not do any calculations until action is taken. And no data will be stored in memory after the action if the cache was not called.
Is there a better way to do what I want to do, namely count the records before they work?
Cache, counting, working looks like a solid plan.
David source share