Streaming Data: Importing Modules on EMR

Question

Streaming Data: Importing Modules on EMR

This previous question dealt with how to import modules such as nltk for streaming data.

The steps described are:

zip -r nltkandyaml.zip nltk yaml mv ntlkandyaml.zip /path/to/where/your/mapper/will/be/nltkandyaml.mod

Now you can import the nltk module for use in your Python script: import zipimport

 importer = zipimport.zipimporter('nltkandyaml.mod') yaml = importer.load_module('yaml') nltk = importer.load_module('nltk')

I have a job that I want to run on Amazon EMR , and I'm not sure where to put the archived files. Do I need to create a loading script under formatting options, or should I put tar.gz in S3 and then in additional arguments? I am new to this and would appreciate an answer that could get me through this process would be greatly appreciated.

+4

python hadoop emr

user1034520 Nov 14 '11 at 23:12

source share

1 answer

Alexey Tigarev · Answer 1 · 2011-11-15T16:00:02+0000

You have the following options:

Create a bootstrap script action and place it on S3. This script will load the module in any format you prefer and place it where it is available for your cartographer / reducer. To find out where exactly you should place the files, start the cluster so that it does not close after completion, ssh there and looked at the directory structure.
Use mrjob to start your workflows. When starting a job with mrjob, you can specify bootstrap_python_packages , which mrjob will install automatically by unpacking .tar.gz and running setup.py install .

http://packages.python.org/mrjob/configs-runners.html

I would prefer option 2 because mrjob also helps in developing MapReduce jobs in Python. In particular, it allows you to perform tasks locally (with or without Hadoop), as well as using EMR, which simplifies debugging.

Streaming Data: Importing Modules on EMR

More articles: