Amazon EMR - How to Set a Timeout for a Step

Is there a way to set a timeout for a step in Amazon Aws EMR ?

I am running an Apache Spark batch job in EMR, and I would like the job to stop with a timeout if it does not end within 3 hours.

I can’t find a way to set the timeout not in Spark, and in the yarn, and in the EMR configuration.

Thank you for your help!

+7
source share
2 answers

Well, as many have already answered, the EMR step cannot be killed / stopped / completed through an API call at the moment.

But to achieve your goals, you can enter a timeout as part of your application code. When you send EMR steps, a child process is created to launch the application - whether it be a MapReduce application, a spark application, etc., and the completion of the step is determined by the completion code of the process the child (which your application) returns.

For example, if you are sending a MapReduce application, you can use something like:

FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); final Runnable stuffToDo = new Thread() { @Override public void run() { job.submit(); } }; final ExecutorService executor = Executors.newSingleThreadExecutor(); final Future future = executor.submit(stuffToDo); executor.shutdown(); // This does not cancel the already-scheduled task. try { future.get(180, TimeUnit.MINUTES); } catch (InterruptedException ie) { /* Handle the interruption. Or ignore it. */ } catch (ExecutionException ee) { /* Handle the error. Or ignore it. */ } catch (TimeoutException te) { /* Handle the timeout. Or ignore it. */ } System.exit(job.waitForCompletion(true) ? 0 : 1); 

Link - Java: set timeout on a specific block of code? .

Hope this helps.

0
source

I would like to propose an alternative approach, without any timeout / shutdown logic that makes the application itself more complex than necessary - although I'm obviously late for the party. Maybe this will prove useful to someone in the future.

You can:

  • write a Python script and use it as a shell for regular Yarn commands
  • execute these yarn commands through the lib subprocess
  • disassemble their conclusion of your own free will
  • decide which yarn applications should be killed

More about what I'm talking about, stay tuned ...

Python shell script and running Yarn commands through the lib subprocess

 import subprocess running_apps = subprocess.check_output(['yarn', 'application', '--list', '--appStates', 'RUNNING'], universal_newlines=True) 

This snippet will give you an output similar to something like this:

 Total number of applications (application-types: [] and states: [RUNNING]):1 Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL application_1554703852869_0066 HIVE-645b9a64-cb51-471b-9a98-85649ee4b86f TEZ hadoop default RUNNING UNDEFINED 0% http://ip-xx-xxx-xxx-xx.eu-west-1.compute.internal:45941/ui/ 

You can analyze this output (remember that several applications can be launched) and extract the values ​​of the application identifier.

Then, for each of these application identifiers, you can call another yarn command to get more detailed information about a specific application:

 app_status_string = subprocess.check_output(['yarn', 'application', '--status', app_id], universal_newlines=True) 

The output of this command should be something like this:

 Application Report : Application-Id : application_1554703852869_0070 Application-Name : com.organization.YourApp Application-Type : HIVE User : hadoop Queue : default Application Priority : 0 Start-Time : 1554718311926 Finish-Time : 0 Progress : 10% State : RUNNING Final-State : UNDEFINED Tracking-URL : http://ip-xx-xxx-xxx-xx.eu-west-1.compute.internal:40817 RPC Port : 36203 AM Host : ip-xx-xxx-xxx-xx.eu-west-1.compute.internal Aggregate Resource Allocation : 51134436 MB-seconds, 9284 vcore-seconds Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds Log Aggregation Status : NOT_START Diagnostics : Unmanaged Application : false Application Node Label Expression : <Not set> AM container Node Label Expression : CORE 

With this, you can also extract the launch time of the application, compare it with the current time and see how long it works. If it works longer than some threshold number of minutes, for example, you kill it.

How do you kill that? Easy.

 kill_output = subprocess.check_output(['yarn', 'application', '--kill', app_id], universal_newlines=True) 

This should be so in terms of step / application killing.

Approach automation

AWS EMR has a great feature called “bootstrapping actions”. It takes a number of steps to create an EMR cluster and can be used to automate this approach.

Add a bash script for bootstrapping actions that will be:

  • upload the Python script you just wrote to the cluster (main node)
  • add python script to crontab

It should be like that.

PS I assumed that Python3 is at our disposal for this purpose.

0
source

Source: https://habr.com/ru/post/1267013/


All Articles