I have a cluster, and I am running wholeTextFiles, which should pull out about a million text files that are summed up to approximately 10GB
. I have one NameNode and two DataNode with 30GBRAM each, 4 cores each. Data is stored in HDFS.
I have no special parameters, and it takes 5 hours to read the data. Was this expected? are there any parameters that should speed up reading (spark configuration or section, number of executors?)
I'm just getting started, and I never had to optimize work to
EDIT: Also, can anyone explain how the wholeTextFiles function works? (not how to use it, but how it was programmed). I am very interested in understanding the section parameter, etc.
EDIT score 2:
So, I tried redistribution after wholeTextFile, the problem is the same because the first read still uses a predefined number of sections, so there are no performance improvements. After loading the data, the cluster works very well ... I have the following warning message when working with data (for 200k files), on the whole TextFile:
15/01/19 03:52:48 WARN scheduler.TaskSetManager: Stage 0 contains a task of very large size (15795 KB). The maximum recommended task size is 100 KB.
Will this cause poor performance? How can I hedge this?
, saveAsTextFile Ambari 19 /. wholeTextFiles, 300 /.....
, , wholeTextFile(path,partitions), . 8 ( ). , ...