I banged my head about this for several days, and I hope that someone there will have an idea.
I wrote that a stream map reduces perl performance, which is prone to the fact that one or two reduction tasks take a lot of time. This is due to the natural asymmetry of the data: some of the decrease keys have more than a million lines, most of which have only a few dozen.
I had problems with long tasks before, and I incremented the counts at all to ensure that the map reduction does not expire. But now they fail with an error message that I have not seen before:
java.io.IOException: Task process exit with nonzero status of 137. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
This is not a standard timeout error message, but error code 137 = 128 + 9 indicates that my script reducer received kill -9 from Hadoop. The tasktracker log contains the following:
2011-09-05 19:18:31,269 WARN org.mortbay.log: Committed before 410 getMapOutput(attempt_201109051336_0003_m_000029_1,7) failed : org.mortbay.jetty.EofException at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:787) at org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:548) at org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:569) at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:946) at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:646) at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:577) at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2940) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:324) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:72) at sun.nio.ch.IOUtil.write(IOUtil.java:43) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334) at org.mortbay.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:169) at org.mortbay.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:221) at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:721) ... 24 more 2011-09-05 19:18:31,289 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.92.8.202:50060, dest: 10.92.8.201:46436, bytes: 7340032, op: MAPRED_SHUFFLE, cliID: attempt_201109051336_0003_m_000029_1 2011-09-05 19:18:31,292 ERROR org.mortbay.log: /mapOutput java.lang.IllegalStateException: Committed at org.mortbay.jetty.Response.resetBuffer(Response.java:994) at org.mortbay.jetty.Response.sendError(Response.java:240) at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2963) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:324) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
I have been looking through the forums, and it seems that warnings are usually found in tasks that run without errors and can usually be ignored. The error pretends that the reducer has lost contact with the output of the card, but I cannot understand why. Does anyone have any ideas?
Here's a potentially important fact: these lengthy tasks made my work take days, where it should take minutes. Since I can live without the output of one or two keys, I tried to implement a ten minute timeout in my reducer as follows:
eval { local $SIG{ALRM} = sub { print STDERR "Processing of new merchant names in $prev_merchant_zip timed out...\n"; print STDERR "reporter:counter:tags,failed_zips,1\n"; die "timeout"; }; alarm 600;
This timeout code works like a charm when I test the script locally, but strange transcoding errors start after I introduced it. However, failures seem to occur after the first timeout, so I'm not sure if this is related.
Thanks in advance for your help!