Clearing spark history logs

We have been running an EMR cluster for a long time, where we send Spark jobs. I see that over time, HDFS fills in Spark application logs, which sometimes lead to an unhealthy host when viewing EMR / Yarn (?).

Running hadoop fs -R -h / shows [1], which clearly states that application logs have never been deleted.

We set spark.history.fs.cleaner.enabled to true (confirmed this in the Spark UI) and hoped for other default values, such as a cleaner interval (1 day) and a cleaner maximum age (7d), as indicated in: http://spark.apache.org/docs/latest/monitoring.html#spark-configuration-options will take care of clearing these logs. But this is not so.

Any ideas?

[1]

 -rwxrwx--- 2 hadoop spark 543.1 M 2017-01-11 13:13 /var/log/spark/apps/application_1484079613665_0001 -rwxrwx--- 2 hadoop spark 7.8 G 2017-01-17 10:51 /var/log/spark/apps/application_1484079613665_0002.inprogress -rwxrwx--- 2 hadoop spark 1.4 G 2017-01-18 08:11 /var/log/spark/apps/application_1484079613665_0003 -rwxrwx--- 2 hadoop spark 2.9 G 2017-01-20 07:41 /var/log/spark/apps/application_1484079613665_0004 -rwxrwx--- 2 hadoop spark 125.9 M 2017-01-20 09:57 /var/log/spark/apps/application_1484079613665_0005 -rwxrwx--- 2 hadoop spark 4.4 G 2017-01-23 10:19 /var/log/spark/apps/application_1484079613665_0006 -rwxrwx--- 2 hadoop spark 6.6 M 2017-01-23 10:31 /var/log/spark/apps/application_1484079613665_0007 -rwxrwx--- 2 hadoop spark 26.4 M 2017-01-23 11:09 /var/log/spark/apps/application_1484079613665_0008 -rwxrwx--- 2 hadoop spark 37.4 M 2017-01-23 11:53 /var/log/spark/apps/application_1484079613665_0009 -rwxrwx--- 2 hadoop spark 111.9 M 2017-01-23 13:57 /var/log/spark/apps/application_1484079613665_0010 -rwxrwx--- 2 hadoop spark 1.3 G 2017-01-24 10:26 /var/log/spark/apps/application_1484079613665_0011 -rwxrwx--- 2 hadoop spark 7.0 M 2017-01-24 10:37 /var/log/spark/apps/application_1484079613665_0012 -rwxrwx--- 2 hadoop spark 50.7 M 2017-01-24 11:40 /var/log/spark/apps/application_1484079613665_0013 -rwxrwx--- 2 hadoop spark 96.2 M 2017-01-24 13:27 /var/log/spark/apps/application_1484079613665_0014 -rwxrwx--- 2 hadoop spark 293.7 M 2017-01-24 17:58 /var/log/spark/apps/application_1484079613665_0015 -rwxrwx--- 2 hadoop spark 7.6 G 2017-01-30 07:01 /var/log/spark/apps/application_1484079613665_0016 -rwxrwx--- 2 hadoop spark 1.3 G 2017-01-31 02:59 /var/log/spark/apps/application_1484079613665_0017 -rwxrwx--- 2 hadoop spark 2.1 G 2017-02-01 12:04 /var/log/spark/apps/application_1484079613665_0018 -rwxrwx--- 2 hadoop spark 2.8 G 2017-02-03 08:32 /var/log/spark/apps/application_1484079613665_0019 -rwxrwx--- 2 hadoop spark 5.4 G 2017-02-07 02:03 /var/log/spark/apps/application_1484079613665_0020 -rwxrwx--- 2 hadoop spark 9.3 G 2017-02-13 03:58 /var/log/spark/apps/application_1484079613665_0021 -rwxrwx--- 2 hadoop spark 2.0 G 2017-02-14 11:13 /var/log/spark/apps/application_1484079613665_0022 -rwxrwx--- 2 hadoop spark 1.1 G 2017-02-15 03:49 /var/log/spark/apps/application_1484079613665_0023 -rwxrwx--- 2 hadoop spark 8.8 G 2017-02-21 05:42 /var/log/spark/apps/application_1484079613665_0024 -rwxrwx--- 2 hadoop spark 371.2 M 2017-02-21 11:54 /var/log/spark/apps/application_1484079613665_0025 -rwxrwx--- 2 hadoop spark 1.4 G 2017-02-22 09:17 /var/log/spark/apps/application_1484079613665_0026 -rwxrwx--- 2 hadoop spark 3.2 G 2017-02-24 12:36 /var/log/spark/apps/application_1484079613665_0027 -rwxrwx--- 2 hadoop spark 9.5 M 2017-02-24 12:48 /var/log/spark/apps/application_1484079613665_0028 -rwxrwx--- 2 hadoop spark 20.5 G 2017-03-10 04:00 /var/log/spark/apps/application_1484079613665_0029 -rwxrwx--- 2 hadoop spark 7.3 G 2017-03-10 04:04 /var/log/spark/apps/application_1484079613665_0030.inprogress 
+5
source share
1 answer

I ran into this problem on emr-5.4.0 and set spark.history.fs.cleaner.interval to 1h and was able to run the cleaner.

For reference, here is the end of my spark-defaults.conf file:

 spark.history.fs.cleaner.enabled true spark.history.fs.cleaner.maxAge 12h spark.history.fs.cleaner.interval 1h 

After making the changes, restart the spark history server.

Another clarification : setting these values ​​during application startup, i.e. spark-submit via --conf no effect. Either install them at the time of creating the cluster using the EMR configuration API, or manually edit the spark-defaults.conf parameter, set these values ​​and restart the spark history server.

+4
source

Source: https://habr.com/ru/post/1265508/


All Articles