What are the advantages and disadvantages of working in Hadoop using different languages?

Question

What are the advantages and disadvantages of working in Hadoop using different languages?

I used Pig or Java for Map Reduce solely to perform tasks against the Hadoop cluster. I recently tried using Python Map Reduce through Hadoop streams, and that was pretty cool. All this makes sense to me, but I'm a little foggy when I would like to use one implementation vs another. The Java map is reduced, I use it mostly exclusively when I need speed, but when I someday want to use something like Python streaming instead of just writing the same thing in smaller, more understandable lines in PIG / Hive? In short, what are the pros and cons for each?

+6

mapreduce hadoop apache-pig

Eli Mar 05 '12 at 15:14

source share

3 answers

Regarding Java versus Pig - I would use pigs in most cases (along with Java UDF) for flexibility and let someone else (pig) figure out what is the best way to split work into cards and reduce, combinators, etc.

I use Java when I absolutely want to control every aspect of the work.

Regarding the use of python (or other languages), which I would use if developers are more comfortable with these other languages. Please note that you can also mix pigs and video streaming

+2

Arnon Rotem-Gal-Oz Mar 05 '12 at 20:12

source share

There is Scala where you can write much simpler code for your assignments. For example, check: https://github.com/NICTA/scoobi

You may probably have some incentive to use C ++ for tasks that are more intensified in memory or CPU. You can read what Hypertable wrote about his solution in C ++: http://code.google.com/p/hypertable/wiki/WhyWeChoseCppOverJava

Java is also problematic on the Serialization side, as it creates an object for any object that it reads from the input stream. You must be careful not to use Java Serialization, simply because you have a Java implementation.

+1

Guy Mar 05 '12 at 20:54

source share

David gruzman · Accepted Answer · 2012-03-05T18:09:17+0000

I will relate separately to Java vs Python, and then separately relate to MR vs Hive / Pig - since I see this as two different problems
Hadoop is built around Java, and many of its features are available through the Java API, and Hadoop can basically be extended using Java classes.

Hadoop do has the ability to work with MR jobs created in other languages - it is called streaming. This model allows us to define a mapping and a reducer with some limitations that are not present in java. At the same time, I / O formats and other plugins should be written as Java classes. Therefore, I would define decision making as follows: a) Use Java if you do not have a serious code base that needs to be executed in your MR task. b) Consider using python when you need to create some simple special tasks.

As for Pig / Hive, these are also higher-level java-oriented systems. Hive can be used without any programming at all, but it can be extended with java. Pigs require Java from the start. I think that these systems are almost always preferable to MR when they can be applied. These are usually cases where processing is similar to SQL.

Performance ratio between streaming and native Java. Streaming supplies the input signal to the converter through its input stream. This is interprocess communication, which is inherently less efficient than in-process data transferred between a reader and a display device in the case of java.
I can draw the following conclusions from above: a) In the case of some light processing (for example, to search for a substring, count ...), these overheads can be significant, and the java solution will be more efficient.
b) In the case of some heavy processing, which can potentially be implemented in a language other than Java, more efficiently - the streaming solution may have some edge.

Pig / Hive performance considerations.
Pig / Hive implements SQL processing primitives. In other words, they implement elements of an implementation plan in the RDBMS world. These implementations are good and well tuned. At the same time, the hive (something that I know better) is the interpreter. It does not generate code generation - it is a execution plan inside pre-created MR tasks. This means that if you have complex difficult conditions and you will write code specifically for them - it has every chance to do much better than Hive - representing the performance advantages of the compiler and interpreter.

What are the advantages and disadvantages of working in Hadoop using different languages?

More articles: