I will relate separately to Java vs Python, and then separately relate to MR vs Hive / Pig - since I see this as two different problems
Hadoop is built around Java, and many of its features are available through the Java API, and Hadoop can basically be extended using Java classes.
Hadoop do has the ability to work with MR jobs created in other languages โโ- it is called streaming. This model allows us to define a mapping and a reducer with some limitations that are not present in java. At the same time, I / O formats and other plugins should be written as Java classes. Therefore, I would define decision making as follows: a) Use Java if you do not have a serious code base that needs to be executed in your MR task. b) Consider using python when you need to create some simple special tasks.
As for Pig / Hive, these are also higher-level java-oriented systems. Hive can be used without any programming at all, but it can be extended with java. Pigs require Java from the start. I think that these systems are almost always preferable to MR when they can be applied. These are usually cases where processing is similar to SQL.
Performance ratio between streaming and native Java. Streaming supplies the input signal to the converter through its input stream. This is interprocess communication, which is inherently less efficient than in-process data transferred between a reader and a display device in the case of java.
I can draw the following conclusions from above: a) In the case of some light processing (for example, to search for a substring, count ...), these overheads can be significant, and the java solution will be more efficient.
b) In the case of some heavy processing, which can potentially be implemented in a language other than Java, more efficiently - the streaming solution may have some edge.
Pig / Hive performance considerations.
Pig / Hive implements SQL processing primitives. In other words, they implement elements of an implementation plan in the RDBMS world. These implementations are good and well tuned. At the same time, the hive (something that I know better) is the interpreter. It does not generate code generation - it is a execution plan inside pre-created MR tasks. This means that if you have complex difficult conditions and you will write code specifically for them - it has every chance to do much better than Hive - representing the performance advantages of the compiler and interpreter.
source share