Python dependency management in EMR

I am sending code to amazon EMR via mrjob / boto modules. I have some external python dependencies (i.e. numpy, boto, etc.) and currently have to download the source code of the python packages and send them as tarball in the "python_archives" field of the mrjob.config file.

this makes dependency management more messy than I would like, and I wonder if I can somehow use the same requirements file. txt, which I use for my virtualenv installation to load an emr instance with my dependencies. Is it possible to configure virtualenv on EMR instances and do something like:

pip install -r requirements.txt 

how would i locally?

+6
source share
2 answers

One way to achieve this is to use the bootstrap action . You can use them to run shell scripts.

If you have a python configuration file that does something like:

 requirements = open("requirements.txt", "r") shell_script = open("pip.sh", "w+") shell_script.write("sudo apt-get install python-pip\n") for line in requirements: shell_script.write("sudo pip install -I " + line) 

Then you can simply run this as a bootstrapping action without loading your requirements. txt

+3
source

So, if you are using mrjob, I have had some success by simply placing pip calls directly in my .mrjob.conf file as a bootstrap action. This is not as elegant as using the requirements.txt file (it will load the same modules for all your tasks). For example, my conf file looks like this:

 runners: emr: aws_access_key_id: xx aws_secret_access_key: xx ec2_key_pair: xx ec2_key_pair_file: xx ssh_tunnel_to_job_tracker: true bootstrap_cmds: - sudo apt-get install -y python-pip - sudo pip install pgnparser - sudo pip install boto 

and it will load the pgnparser and boto for use in my mrjob .

0
source

Source: https://habr.com/ru/post/949055/


All Articles