Hadoop: heap and gc problems

Question

Hadoop: heap and gc problems

I am currently working on a project where I need to have a memory structure for my map task. I did some calculations, and I can say that I do not need more than 600 MB of memory for each map task. But the fact is that after a while I have problems with a bunch of java or a restriction on the upper limits of gc. I do not know how this is possible.

Here are a few more details. I have two quad systems with 12 GB of RAM. Thus, this means that I can perform up to 8 map tasks at a time. I am building a tree, so I have an iterative algorithm that does the job of reducing the map for each level of the tree. My algorithm works fine for small datasets, but for the average dataset there are problems with cumulus space. My algorithm reaches a certain tree level and then it exits the heap or has problems with overhead. At this point, I made several calculations, and I saw that each task does not need more than 100 MB of memory. So, for 8 tasks, I use about 800 MB of memory. I do not know what's going on. I even updated the hadoop-env.sh file with these lines:

export HADOOP_HEAPSIZE=8000 export HADOOP_OPTS=-XX:+UseParallelGC

What is the problem? Are these lines even an override of java parameters for my system? Using parallelGC is what I saw on the Internet, and it was recommended with multiple cores.

  edits

Ok, here are some changes after observing the heap and shared memory. I consume about 3500 MB of RAM while doing 6 tasks at the same time. This means that jobtracker, tasktracker, namenode, datanode, secondary namenode is my operating system and 6 tasks use 3500 RAM, which is a very logical size. So why am I getting the upper limit of gc? I follow the same algorithm for each level of the tree. The only thing that changes is the number of nodes at each level of the tree. Having many nodes at the tree level does not add so much overhead to my algorithm. So why can not gc work well?

+4

java garbage-collection heap multicore hadoop

jojoba Mar 14 '12 at 14:01

source share

1 answer

Peter Lawrey · Accepted Answer · 2012-03-14T14:07:29+0000

If the maximum memory size has not changed, it will be 1/4 of the main memory, that is, about 3 GB plus some overhead for use without heap may be 3.5 GB.

I suggest you try

 export HADOOP_OPTS="-XX:+UseParallelGC -Xmx8g"

to set the maximum memory to 8 GB.

By default, the maximum heap size is 1/4 of the memory (unless you are using the 32-bit JVM for Windows). Therefore, if the maximum heap size is ignored, it will still be 3 GB.

If you use one GC or another, it will not make much difference when you run out of memory.

I suggest you take a bunch of heaps with -XX:+HeapDumpOnOutOfMemoryError and read this in the profiler, for example. VisualVM to understand why it uses so much memory.

Hadoop: heap and gc problems

More articles: