Write to multiple exits using the Scalding Hadoop key, one MapReduce Job

Question

Write to multiple exits using the Scalding Hadoop key, one MapReduce Job

How can you write to multiple key-dependent outputs using Scalding (/ cascading) in a single map reduction job. Of course, I could use .filter for all possible keys, but this is a terrible hack that will run many jobs.

+6

scala mapreduce hadoop cascading scalding

samthebest Jun 02 '14 at 12:16

source share

3 answers

Use MultipleOutputFormat and extrapolate from these other SO questions to write a custom output class using the output format: Create a Scalding Source, for example TextLine, which merges several files into separate cards , Compress Scalding / Cascading TsvCompressed output

0

samthebest Jun 02

source share

This proposal in the Cascading User group suggests using the Cascading TemplateTap . Not sure how to connect this to Scalding.

0

Sasha O Jun 02 '14 at 18:27

source share

morazow · Accepted Answer · 2014-06-25 12:04

Scalding (from version 0.9.0rc16 and higher) has TemplatedTsv , just like Cascading TemplateTsv.

 Tsv(args("input"), ('COUNTRY, 'GDP)) .read .write(TemplatedTsv(args("output"), "%s", 'COUNTRY)) // it will create a directory for each country under "output" path in Hadoop mode.

Write to multiple exits using the Scalding Hadoop key, one MapReduce Job

More articles: