Hadoop streaming failure: process exit with non-zero status 137

I banged my head about this for several days, and I hope that someone there will have an idea.

I wrote that a stream map reduces perl performance, which is prone to the fact that one or two reduction tasks take a lot of time. This is due to the natural asymmetry of the data: some of the decrease keys have more than a million lines, most of which have only a few dozen.

I had problems with long tasks before, and I incremented the counts at all to ensure that the map reduction does not expire. But now they fail with an error message that I have not seen before:

java.io.IOException: Task process exit with nonzero status of 137. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) 

This is not a standard timeout error message, but error code 137 = 128 + 9 indicates that my script reducer received kill -9 from Hadoop. The tasktracker log contains the following:

 2011-09-05 19:18:31,269 WARN org.mortbay.log: Committed before 410 getMapOutput(attempt_201109051336_0003_m_000029_1,7) failed : org.mortbay.jetty.EofException at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:787) at org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:548) at org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:569) at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:946) at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:646) at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:577) at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2940) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:324) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:72) at sun.nio.ch.IOUtil.write(IOUtil.java:43) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334) at org.mortbay.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:169) at org.mortbay.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:221) at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:721) ... 24 more 2011-09-05 19:18:31,289 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.92.8.202:50060, dest: 10.92.8.201:46436, bytes: 7340032, op: MAPRED_SHUFFLE, cliID: attempt_201109051336_0003_m_000029_1 2011-09-05 19:18:31,292 ERROR org.mortbay.log: /mapOutput java.lang.IllegalStateException: Committed at org.mortbay.jetty.Response.resetBuffer(Response.java:994) at org.mortbay.jetty.Response.sendError(Response.java:240) at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2963) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:324) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522) 

I have been looking through the forums, and it seems that warnings are usually found in tasks that run without errors and can usually be ignored. The error pretends that the reducer has lost contact with the output of the card, but I cannot understand why. Does anyone have any ideas?

Here's a potentially important fact: these lengthy tasks made my work take days, where it should take minutes. Since I can live without the output of one or two keys, I tried to implement a ten minute timeout in my reducer as follows:

 eval { local $SIG{ALRM} = sub { print STDERR "Processing of new merchant names in $prev_merchant_zip timed out...\n"; print STDERR "reporter:counter:tags,failed_zips,1\n"; die "timeout"; }; alarm 600; #Code that could take a long time to execute alarm 0; }; 

This timeout code works like a charm when I test the script locally, but strange transcoding errors start after I introduced it. However, failures seem to occur after the first timeout, so I'm not sure if this is related.

Thanks in advance for your help!

+6
source share
3 answers

Two possibilities come to mind:

  • Using RAM, if the tasks use too much RAM, the OS can kill it (after a terrible replacement, etc.).
  • Do you use any non-reentrant libraries? Perhaps the timer starts at the wrong point in the library call.
+10
source

Exit code 137 is a typical sign of the infamous OOM killer. You can easily check it with the dmesg for such messages:

 [2094250.428153] CPU: 23 PID: 28108 Comm: node Tainted: GCO 3.16.0-4-amd64 #1 Debian 3.16.7-ckt20-1+deb8u2 [2094250.428155] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015 [2094250.428156] ffff880773439400 ffffffff8150dacf ffff881328ea32f0 ffffffff8150b6e7 [2094250.428159] ffff881328ea3808 0000000100000000 ffff88202cb30080 ffff881328ea32f0 [2094250.428162] ffff88107fdf2f00 ffff88202cb30080 ffff88202cb30080 ffff881328ea32f0 [2094250.428164] Call Trace: [2094250.428174] [<ffffffff8150dacf>] ? dump_stack+0x41/0x51 [2094250.428177] [<ffffffff8150b6e7>] ? dump_header+0x76/0x1e8 [2094250.428183] [<ffffffff8114044d>] ? find_lock_task_mm+0x3d/0x90 [2094250.428186] [<ffffffff8114088d>] ? oom_kill_process+0x21d/0x370 [2094250.428188] [<ffffffff8114044d>] ? find_lock_task_mm+0x3d/0x90 [2094250.428193] [<ffffffff811a053a>] ? mem_cgroup_oom_synchronize+0x52a/0x590 [2094250.428195] [<ffffffff8119fac0>] ? mem_cgroup_try_charge_mm+0xa0/0xa0 [2094250.428199] [<ffffffff81141040>] ? pagefault_out_of_memory+0x10/0x80 [2094250.428203] [<ffffffff81057505>] ? __do_page_fault+0x3c5/0x4f0 [2094250.428208] [<ffffffff8109d017>] ? put_prev_entity+0x57/0x350 [2094250.428211] [<ffffffff8109be86>] ? set_next_entity+0x56/0x70 [2094250.428214] [<ffffffff810a2c61>] ? pick_next_task_fair+0x6e1/0x820 [2094250.428219] [<ffffffff810115dc>] ? __switch_to+0x15c/0x570 [2094250.428222] [<ffffffff81515ce8>] ? page_fault+0x28/0x30 

You can easily if OOM is enabled:

 $ cat /proc/sys/vm/overcommit_memory 0 

Basically, the OOM killer is trying to kill a process that eats most of the memory. And you probably don't want to disable it :

The OOM killer can be completely disabled by the following command. This is not recommended for production environments, because if the condition of lack of memory is actually present, it may be unexpected depending on the available system resources and configuration. This unexpected behavior can be anything: a kernel panic that hangs depending on the resources available to the kernel during the OOM condition.

 sysctl vm.overcommit_memory=2 echo "vm.overcommit_memory=2" >> /etc/sysctl.conf 

The same situation can happen if you use, for example, cgroups to limit memory. When the process exceeds the specified limit, it will be killed without warning.

+5
source

I got this error. Kill the day and find that it was an endless loop somewhere in the code.

0
source

Source: https://habr.com/ru/post/896664/


All Articles