Why does a large unpaid HashMap increase Java performance?

I have a performance problem due to which I cannot lower my head. I am writing a Java application that parses huge (> 20 million lines) text files and stores certain information in a set. I measure performance in seconds per million rows. Since I need a lot of memory, I usually run the program with -Xmx6000m and -Xms4000m.

If I just run the program, it analyzes 1 million lines in 6 seconds. However, after some performance checks, I realized that if I add this code before the actual parsing procedure, the performance increases to less than 3 seconds per 1 million lines:

BufferedReader br = new BufferedReader(new FileReader("graphs.nt")); HashMap<String, String> foo = new HashMap<String, String>(); String line; while ((line = br.readLine()) != null){ foo.put(line, "foo"); } foo = null; br.close(); br = null; 

The graphs.nt file is about 9 million lines long. The performance improvement is maintained even if I do not set foo to null, this basically demonstrates that the card is not actually used by the program.

The rest of the code is completely unrelated. I use the parser from openrdf sesame to read another (not the graphs.nt file) and save the extracted information in a new HashSet created by another object. In the rest of the code, I create a Parser object to which I pass the Handler Object .

It really bothers me. I guess this somehow makes the JVM allocate more memory for my program, and I can see the prompts when I run the top. Without a HashMap, it allocates about 1 GB of memory. If I initialize the HashMap, it will highlight> 2 Gigs.

My question is: if that sounds perfectly reasonable. Is it possible that creating such a large object will allocate more memory for subsequent program operation? Should -Xmx and -Xms control the allocation of memory, or are there additional arguments that might play a role here?

I know this may seem like a strange question and there is not enough information, but this is all the information that I found related to this problem. If there is additional information that may be useful, I am more than happy to provide it.

+4
source share
3 answers

Memory and GC can affect performance. If possible, you should run Xms == Xmx to disable resizing and give the JVM plenty of space at startup. Your application may exit before any major GC is required.

0
source

Unless you do it differently, "foo" will eventually go out of scope and be compiled even if you don't lose the pointer, and even if the method containing the above code never exits, But it will make the bunch grow bigger, and that’s reduce relative GC overhead.

(It would be an interesting experiment to reference "foo" at the end of your program to keep it in the area.)

0
source

Does this look like file caching? Your file "graphs.nt" is probably cached in RAM by either the OS or the JVM. GC will increase memory consumption for performance reasons, if you add a forced collection immediately after preloading, System.gc() , you can find out if caching occurs in the JVM or in the OS.

0
source

Source: https://habr.com/ru/post/1492757/


All Articles