I'm trying to use mrjob to run hadoop in EMR and can't figure out how to configure logging (user-created logs in the map / reduce steps), so I can access them after the cluster finishes.
I tried to configure logging using the logging , print and sys.stderr.write() module, but so far no luck. The only option that works for me is to write the logs to a file, then the SSH machine and read it, but its bulkiness. I would like my logs to go to stderr / stdout / syslog and automatically collect to S3, so I can view them after the cluster finishes.
Here is a logging word_freq example:
"""The classic MapReduce job: count the frequency of words. """ from mrjob.job import MRJob import re import logging import logging.handlers import sys WORD_RE = re.compile(r"[\w']+") class MRWordFreqCount(MRJob): def mapper_init(self): self.logger = logging.getLogger() self.logger.setLevel(logging.INFO) self.logger.addHandler(logging.FileHandler("/tmp/mr.log")) self.logger.addHandler(logging.StreamHandler()) self.logger.addHandler(logging.StreamHandler(sys.stdout)) self.logger.addHandler(logging.handlers.SysLogHandler()) def mapper(self, _, line): self.logger.info("Test logging: %s", line) sys.stderr.write("Test stderr: %s\n" % line) print "Test print: %s" % line for word in WORD_RE.findall(line): yield (word.lower(), 1) def combiner(self, word, counts): yield (word, sum(counts)) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run()
Beka source share