Does Spark count records in an expensive RDD task?

Question

Does Spark count records in an expensive RDD task?

In Hadoop, when I use the inputformat reader, the logs at the job level tell me how many records have been read, it also displays the number of bytes, etc.

In Spark, when I use the same input format, I get the wrong metrics.

So, I think I would use an inputformat reader to populate rdd, and then just post the number of records to rdd (rdd size).

I know rdd.count() returns the size of rdd.

However, the cost of using count() is not clear to me? For instance:

Is this a distributed function? Will each section report its calculation and are the totals summarized and reported? Or is all rdd delivered to the driver and considered?
After doing count() will rdd remain in memory, or should I explicitly cache it?
Is there a better way to do what I want to do, namely counting records before they work?

+5

java hadoop apache-spark

hba Apr 19 '16 at 16:14

source share

1 answer

David · Accepted Answer · 2016-04-19T16:21:26+0000

Is this a distributed function? Will each section report its account and are the totals summarized and reported? Or is it all rdd brought in the driver and counted?

Count is distributed. In spark nomenclature, an account is an “Action”. All actions are distributed. Indeed, there are only a few things that lead all the data to the node driver, and they are usually well-documented (for example, take, collect, etc.).

After running count (), rdd will still remain in memory or do I have to explicitly cache it?

No, the data will not be in memory. If you want this, you need to cache explicitly before counting. Sparks of lazy evaluation will not do any calculations until action is taken. And no data will be stored in memory after the action if the cache was not called.

Is there a better way to do what I want to do, namely count the records before they work?

Cache, counting, working looks like a solid plan.

Does Spark count records in an expensive RDD task?

More articles: