Fine-tuning PIG for local execution

Question

Fine-tuning PIG for local execution

I use PIG latin for processing logs, because its expressiveness is in a problem when the data is not large enough to worry about creating a whole cluster of haops. I run PIG in local mode, but I think that it does not use all the cores available to it (16 at the moment), CPU monitoring shows 200% maximum CPU usage.

Is there any tutorial or recommendations for fine tuning PIG for local execution? I am sure that all cartographers can use all available kernels with some easy configuration. (In my script, I already set the default_parallel parameter to 20)

Sincerely.

+4

mapreduce hadoop apache-pig

tonicebrian Dec 16 '10 at 12:25

source share

1 answer

MrGomez · Accepted Answer · 2010-12-16T20:56:50+0000

The pig documentation makes it clear that the local operation is for single-threaded processing, using different code paths for certain functions that would otherwise use distributed sorting. As a result, optimization for the local Pig mode seems to be the wrong solution to the presented problem.

Have you considered starting a local “pseudo-distributed” cluster instead of investing in a full cluster setup? You can follow the Hadoop instructions for the pseudo-distributed operation, then point Pig to localhost . This would bring the desired result due to a two-stage start and failure.

You want to increase the number of default cards and reducers to consume all the cores available on your computer. Fortunately, this is fairly well documented (admittedly in the cluster configuration documentation ); just define mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum in your local copy of $HADOOP_HOME/conf/mapred-site.xml .

Fine-tuning PIG for local execution

More articles: