If you see this error all of a sudden, it could be due to temporary drifts of virtual machines.
All virtual machines may be subject to temporary drift.
System time can drift for several minutes on long clusters if it is not in sync with a known good time source. Thus, all of your cluster nodes, using their own system, can move from time to time from time to time.
Your Hadoop jobs may run successfully because the drift may not be quite noticeable. However, on long clusters, if one working time drifts for too long (compared to the main time) that it exceeds a 10-minute interval, then tasks will not be completed because the YARN containers scheduled for these workers will be marked EXPIRED, as soon as AM serves it.
The key part:
"For any container, if the corresponding NM does not tell RM that the container has started to work within the specified time interval, the default time is 10 minutes, the container is considered dead and RM has expired."
You can learn more about the distribution of the YARN container here: http://hortonworks.com/blog/apache-hadoop-yarn-resourcemanager/
So, jobs will work if you increase yarn.resourcemanager.rm.container-allocation.expiry-interval-ms in the yarn-site.xml configuration file.
But this is only a temporary solution.
To avoid an actual problem, you need to use some kind of synchronization mechanism like NTP.
NTP is responsible for synchronizing time with global time servers and your Master / Work nodes.
You need to make sure that the NTP daemon is up and running on all nodes of the cluster. NTP must also remain βsynchronizedβ ( ntpstat ) throughout the life of the cluster. Some obvious issues that can lead to NTP out of sync
- Your firewall can block UDP port 123.
- You may have an AD environment with a different time synchronization contrary to NTP.
http://support.ntp.org/bin/view/Support/TroubleshootingNTP