Local pig mode, group or join = java.lang.OutOfMemoryError: Java heap space

Question

Local pig mode, group or join = java.lang.OutOfMemoryError: Java heap space

Using Apache Pig version 0.10.1.21 (reported), CentOS release 6.3 (Final), jdk1.6.0_31 (The Hortonworks Sandbox v1.2 on Virtualbox with 3.5 GB of RAM)

$ cat data.txt 11,11,22 33,34,35 47,0,21 33,6,51 56,6,11 11,25,67 $ cat GrpTest.pig A = LOAD 'data.txt' USING PigStorage(',') AS (f1:int,f2:int,f3:int); B = GROUP A BY f1; DESCRIBE B; DUMP B;

pig -x local GrpTest.pig

 [Thread-12] WARN org.apache.hadoop.mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). [Thread-12] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 [Thread-13] INFO org.apache.hadoop.mapred.Task - Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@19a9bea3 [Thread-13] INFO org.apache.hadoop.mapred.MapTask - io.sort.mb = 100 [Thread-13] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0002 java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:949) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:674) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed! [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias B

The java.lang.OutOfMemoryError: Java heap space error occurs every time I use GROUP or JOIN in a swing script executed in local mode. There is no error when the script is executed in mapreduce mode on HDFS.

Question 1 : How does the OutOfMemory error occur when the data sample is insignificant and the local mode should use less resources than the HDFS mode?

Question 2 : Is there a solution for successfully running small pig scripts with GROUP or JOIN in local mode?

+6

apache-pig

Polymerase May 11, '13 at 16:36

source share

2 answers

The reason is that there is less memory on your local computer than on your Hadoop cluster machines. This is actually a fairly common mistake in Hadoop. This mainly happens when you create a very long relation in Pig at any point and happens because Pig always wants to load the whole relation into memory and does not want to be lazy to load it in any way.

When you do something like GROUP BY , where the tuple that you are grouping is not sparse across many records, you often finish creating long relationships at least temporarily, since you basically use a whole bunch of individual relationships and hammer them all into one long attitude. Either change your code so that you don’t get a single very long relationship at any time (i.e., Groups using something more sparse), or increase the available memory for Java.

0

Eli May 15, '13 at 3:29

source share

Polymerase · Accepted Answer · 2013-05-18T18:15:32+0000

Solution: make the pigs allocate less memory for the java io.sort.mb property I installed 10 MB here and the error goes away. Not sure what would be the best value, but at least that allows you to practice pig syntax in local mode.

 $ cat GrpTest.pig --avoid java.lang.OutOfMemoryError: Java heap space (execmode: -x local) set io.sort.mb 10; A = LOAD 'data.txt' USING PigStorage(',') AS (f1:int,f2:int,f3:int); B = GROUP A BY f1; DESCRIBE B; DUMP B;

Local pig mode, group or join = java.lang.OutOfMemoryError: Java heap space

More articles: