How does the number of sections affect the `wholeTextFiles` and` textFiles`?

Question

How does the number of sections affect the `wholeTextFiles` and` textFiles`?

In sparks, I understand how to use wholeTextFilesand textFiles, but I'm not sure what to use when. Here is what I know so far:

When working with files that are not divided into lines, you should use it wholeTextFiles, otherwise use it textFiles.

I would think that the default sections are wholeTextFilesboth textFileson the contents of the file and on the lines, respectively. But both of them allow you to change the parameter minPartitions.

So how does changing partitions affect how they are processed?

For example, let's say I have one very large file with 100 lines. What is the difference between treating it as wholeTextFileswith 100 partiions and treating it as textFile(which separates it line by line) using the default value of 100.

What is the difference between the two?

+4

python apache-spark pyspark

Sother Nov 25 '15 at 0:40

source share

1 answer

climbage · Answer 1 · 2015-11-25T21:05:38+0000

For reference, wholeTextFilesuses WholeTextFileInputFormatthat extends CombineFileInputFormat .

A few notes on wholeTextFiles.

Each entry in the RDD returned wholeTextFileshas a file name and all the contents of the file. This means that the file cannot be split (in general).
CombineFileInputFormat, .

, , . minPartitions=2, , , .

, minPartitions=3, , wholeTextFiles , RDD .

How does the number of sections affect the `wholeTextFiles` and` textFiles`?

More articles: