Spark: sc.WholeTextFiles takes a long time to execute

Question

Spark: sc.WholeTextFiles takes a long time to execute

I have a cluster, and I am running wholeTextFiles, which should pull out about a million text files that are summed up to approximately 10GB . I have one NameNode and two DataNode with 30GBRAM each, 4 cores each. Data is stored in HDFS.

I have no special parameters, and it takes 5 hours to read the data. Was this expected? are there any parameters that should speed up reading (spark configuration or section, number of executors?)

I'm just getting started, and I never had to optimize work to

EDIT: Also, can anyone explain how the wholeTextFiles function works? (not how to use it, but how it was programmed). I am very interested in understanding the section parameter, etc.

EDIT score 2:

So, I tried redistribution after wholeTextFile, the problem is the same because the first read still uses a predefined number of sections, so there are no performance improvements. After loading the data, the cluster works very well ... I have the following warning message when working with data (for 200k files), on the whole TextFile:

15/01/19 03:52:48 WARN scheduler.TaskSetManager: Stage 0 contains a task of very large size (15795 KB). The maximum recommended task size is 100 KB.

Will this cause poor performance? How can I hedge this?

, saveAsTextFile Ambari 19 /. wholeTextFiles, 300 /.....

, , wholeTextFile(path,partitions), . 8 ( ). , ...

+2

optimization scala hadoop configuration apache-spark

Stephane 16 . '15 17:37

1

0x0FFF · Accepted Answer · 2015-01-20T13:02:13+0000

:

HDFS . , NameNode , , (~ 100 . ). , , NameNode , DataNode. .
. Spark YARN (--num-executors) 1 (--executor-cores) 512 (--executor-memory), 2 512 , -world

, :

Spark --num-executors 4 --executor-memory 12g --executor-cores 4, parallelism - 16 , , 16
sc.wholeTextFiles , (, Snappy), , : http://0x0fff.com/spark-hdfs-integration/. , .

Spark: sc.WholeTextFiles takes a long time to execute

More articles: