Spark: sc.WholeTextFiles takes a long time to execute

I have a cluster, and I am running wholeTextFiles, which should pull out about a million text files that are summed up to approximately 10GB . I have one NameNode and two DataNode with 30GBRAM each, 4 cores each. Data is stored in HDFS.

I have no special parameters, and it takes 5 hours to read the data. Was this expected? are there any parameters that should speed up reading (spark configuration or section, number of executors?)

I'm just getting started, and I never had to optimize work to

EDIT: Also, can anyone explain how the wholeTextFiles function works? (not how to use it, but how it was programmed). I am very interested in understanding the section parameter, etc.

EDIT score 2:

So, I tried redistribution after wholeTextFile, the problem is the same because the first read still uses a predefined number of sections, so there are no performance improvements. After loading the data, the cluster works very well ... I have the following warning message when working with data (for 200k files), on the whole TextFile:

15/01/19 03:52:48 WARN scheduler.TaskSetManager: Stage 0 contains a task of very large size (15795 KB). The maximum recommended task size is 100 KB.

Will this cause poor performance? How can I hedge this?

, saveAsTextFile Ambari 19 /. wholeTextFiles, 300 /.....

, , wholeTextFile(path,partitions), . 8 ( ). , ...

+2
1

:

  • HDFS . , NameNode , , (~ 100 . ). , , NameNode , DataNode. .
  • . Spark YARN (--num-executors) 1 (--executor-cores) 512 (--executor-memory), 2 512 , -world

, :

+6

Source: https://habr.com/ru/post/1617353/


All Articles