Too many open files in EMR

I get the following exercise in my gearboxes:

EMFILE: Too many open files at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method) at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:161) at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:296) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:369) at org.apache.hadoop.mapred.Child$4.run(Child.java:257) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:249) 

About 10,000 files are created in the reducer. Is there a way that I can install ulimit on each window.

I tried using the following command as a bootstrap script: ulimit -n 1000000

But it did not help.

I also tried the following in a bootstrap action to replace the ulimit command in / usr / lib / hadoop / hadoop -daemon.sh:

 #!/bin/bash set -e -x sudo sed -i -e "/^ulimit /s|.*|ulimit -n 134217728|" /usr/lib/hadoop/hadoop-daemon.sh 

But even when we enter the master node, I see that ulimit -n returns: 32768. I also confirmed that the desired change was made to / usr / lib / hadoop / hadoop -daemon.sh, and it had: ulimit -n 134217728.

Do we have any hadoop configurations? Or is there a workaround for this?

My main goal is to divide records into files according to the identifiers of each record, and now there are 1.5 billion records, which can certainly increase.

Is it possible to edit this file before this daemon is launched on each slave?

+4
source share
4 answers

OK, it seems that ulimit is installed by default in Amazon EMR setup: 32768 is already too much, and if any work requires more than that, then you should reconsider your logic. Therefore, instead of writing each file directly to s3, I wrote them locally and moved to s3 in batches of 1024 files. This solved the problem of too many open files .

Perhaps when the file descriptors were open for writing on s3, they were not freed / closed, as would be the case when writing to local files. Any better explanation is appreciated.

+2
source

Perhaps there is a way to do this using boot actions , in particular one of the predefined ones. And if the predefined ones don't work, custom scripts can do everything you can usually do on any linux cluster. But first, I would ask why you output so many files? HDFS / Hadoop is definitely more optimized for smaller files. If you are hoping to do some kind of indexing, writing raw files with different names is probably not the best approach.

+1
source

I had this problem, but this is linux setup.

Solve it by going here and follow these steps:

http://www.cyberciti.biz/faq/linux-unix-nginx-too-many-open-files/

0
source

I think the right solution here is to have one sequence file, the contents of which are each of your binary files, with a key by file name. It is great to split records into files, but these files can be saved as blobs, using the file name, in one large sequence file.

0
source

Source: https://habr.com/ru/post/1440513/


All Articles