Hadoop MapReduce reads a dataset once for multiple jobs

Question

Hadoop MapReduce reads a dataset once for multiple jobs

I have a data set formed by many small files (on average 30-40 MB each). I wanted to run MapReduce analytics on them, but with each task the cartographer reads the files again, which puts a big load on I / O performance (overhead, etc.).

I wanted to know if it is possible to use the cartographer once, to allocate various different outputs for different gearboxes? When I looked around, I saw that several gearboxes are impossible, but the only possible thing is purposeful work. However, I want to run these tasks in parallel, and not sequentially, since they will all use the same dataset as input, and run different analytics. So, in general, the thing I want looks something like this:

Reducer = Analytics1 /

Mapper - Reducer = Analytics2

  \ Reducer = Analytics3 ...

Is it possible? Or do you have suggestions for a workaround? Please give me some ideas. Reading these small files again creates huge overhead and performance degradation for my analysis.

Thanks in advance!

Edit: I forgot to mention that I am using Hadoop v2.1.0-beta with YARN.

+6

performance io mapreduce reduce hadoop

Engin sözer Oct 10 '13 at 15:22

source share

2 answers

Perhaps this is possible with a custom page separator . A custom splitter redirects the output of the inverter to the corresponding key-based reducer. Thus, the card output key must be R1 * , R2 * , R3 ***. The pros and cons of this approach need to be explored.

As mentioned, Tez is one of the alternatives, but it is still in the incubator phase.

+3

Praveen sripati Oct 10 '13 at 18:23

source share

cabad · Accepted Answer · 2013-10-10T15:38:58+0000

You can:

Ask the gearbox (s) to do all the analytics (1-3) in the same run / mission. EDIT: From your comment, I see that this alternative is not useful to you, but I leave it here for future use, as in some cases this can be done.
Use a more generalized model than MapReduce. For example, Apache Tez (still an incubator project) can be used for your use.

Some useful links to Apache Tez:

A research paper that describes Apache YARN and related projects, including Apache Tez.
A few blog posts explaining the Tez model.

EDIT: Added the following regarding Alternative 1:

You can also force mapper to generate a key that indicates for which analytics the output is being processed. Hadoop automatically groups records by this key and sends them to all the same reducers. The value generated by the cartographers will be a tuple of the form <k,v> , where key ( k ) is the source key that you intended to generate. Thus, the converter generates records <k_analytics, <k,v>> . The gearbox has a gearbox method that reads the key, and depending on the key, calls the corresponding analytics method (within your gearbox class). This approach will work, but only if your gearboxes do not have to deal with a huge amount of data, since you probably need to store it in memory (in a list or hash table) while you are doing the analytics process (since <k,v> tuples are not sorted by their key). If this is not what your reducer can handle, then you might need to examine the custom delimiter suggested by @ praveen-sripati.

EDIT: As suggested by @ jud-mental, alternative 1 can be further improved by issuing mappers <<k_analytics, k>, value> ; in other words, enter the key inside the analytic type part, not the value, so that the reducer will receive all the keys for one analytic task, grouped together, and can perform stream operations on the values without storing them in RAM.

Hadoop MapReduce reads a dataset once for multiple jobs

More articles: