You can:
- Ask the gearbox (s) to do all the analytics (1-3) in the same run / mission. EDIT: From your comment, I see that this alternative is not useful to you, but I leave it here for future use, as in some cases this can be done.
- Use a more generalized model than MapReduce. For example, Apache Tez (still an incubator project) can be used for your use.
Some useful links to Apache Tez:
EDIT: Added the following regarding Alternative 1:
You can also force mapper to generate a key that indicates for which analytics the output is being processed. Hadoop automatically groups records by this key and sends them to all the same reducers. The value generated by the cartographers will be a tuple of the form <k,v> , where key ( k ) is the source key that you intended to generate. Thus, the converter generates records <k_analytics, <k,v>> . The gearbox has a gearbox method that reads the key, and depending on the key, calls the corresponding analytics method (within your gearbox class). This approach will work, but only if your gearboxes do not have to deal with a huge amount of data, since you probably need to store it in memory (in a list or hash table) while you are doing the analytics process (since <k,v> tuples are not sorted by their key). If this is not what your reducer can handle, then you might need to examine the custom delimiter suggested by @ praveen-sripati.
EDIT: As suggested by @ jud-mental, alternative 1 can be further improved by issuing mappers <<k_analytics, k>, value> ; in other words, enter the key inside the analytic type part, not the value, so that the reducer will receive all the keys for one analytic task, grouped together, and can perform stream operations on the values โโwithout storing them in RAM.
cabad source share