I would like to propose an alternative approach, without any timeout / shutdown logic that makes the application itself more complex than necessary - although I'm obviously late for the party. Maybe this will prove useful to someone in the future.
You can:
- write a Python script and use it as a shell for regular Yarn commands
- execute these yarn commands through the lib subprocess
- disassemble their conclusion of your own free will
- decide which yarn applications should be killed
More about what I'm talking about, stay tuned ...
Python shell script and running Yarn commands through the lib subprocess
import subprocess running_apps = subprocess.check_output(['yarn', 'application', '--list', '--appStates', 'RUNNING'], universal_newlines=True)
This snippet will give you an output similar to something like this:
Total number of applications (application-types: [] and states: [RUNNING]):1 Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL application_1554703852869_0066 HIVE-645b9a64-cb51-471b-9a98-85649ee4b86f TEZ hadoop default RUNNING UNDEFINED 0% http:
You can analyze this output (remember that several applications can be launched) and extract the values of the application identifier.
Then, for each of these application identifiers, you can call another yarn command to get more detailed information about a specific application:
app_status_string = subprocess.check_output(['yarn', 'application', '--status', app_id], universal_newlines=True)
The output of this command should be something like this:
Application Report : Application-Id : application_1554703852869_0070 Application-Name : com.organization.YourApp Application-Type : HIVE User : hadoop Queue : default Application Priority : 0 Start-Time : 1554718311926 Finish-Time : 0 Progress : 10% State : RUNNING Final-State : UNDEFINED Tracking-URL : http://ip-xx-xxx-xxx-xx.eu-west-1.compute.internal:40817 RPC Port : 36203 AM Host : ip-xx-xxx-xxx-xx.eu-west-1.compute.internal Aggregate Resource Allocation : 51134436 MB-seconds, 9284 vcore-seconds Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds Log Aggregation Status : NOT_START Diagnostics : Unmanaged Application : false Application Node Label Expression : <Not set> AM container Node Label Expression : CORE
With this, you can also extract the launch time of the application, compare it with the current time and see how long it works. If it works longer than some threshold number of minutes, for example, you kill it.
How do you kill that? Easy.
kill_output = subprocess.check_output(['yarn', 'application', '--kill', app_id], universal_newlines=True)
This should be so in terms of step / application killing.
Approach automation
AWS EMR has a great feature called “bootstrapping actions”. It takes a number of steps to create an EMR cluster and can be used to automate this approach.
Add a bash script for bootstrapping actions that will be:
- upload the Python script you just wrote to the cluster (main node)
- add python script to crontab
It should be like that.
PS I assumed that Python3 is at our disposal for this purpose.