A large number of objects in Java (using HashMap)

Hello,

I am currently working on word prediction in Java. For this, I use a model based on NGram, but I have memory problems ...

For the first time I had such a model:

public class NGram implements Serializable { private static final long serialVersionUID = 1L; private transient int count; private int id; private NGram next; public NGram(int idP) { this.id = idP; } } 

But it requires a lot of memory, so I thought that I needed optimization, and I thought that if I "greet the world" and "hello people", instead of getting two ngrams, I could keep in one that save " Hi, and then there are two possibilities: "people" and "peace."

To be more clear, this is my new model:

 public class BNGram implements Serializable { private static final long serialVersionUID = 1L; private int id; private HashMap<Integer,BNGram> next; private int count = 1; public BNGram(int idP) { this.id = idP; this.next = new HashMap<Integer, BNGram>(); } } 

But it seems that my second model consumes twice as much memory ... I think this is due to the HashMap, but I can not reduce it? I tried to use various Map implementations, such as Trove or others, but this does not change anything.

To give you an idea, for text with 9 MB with 57818 different words (different, but this is not the total number of words), after generating NGram, my javaw process consumes 1.2 GB of memory ... If I save it using GZIPOutputStream, it takes about 18 MB of disk space.

So my question is: how can I use less memory ? Can I do something with compression (like Serialization). I need to add this to another application, so I need to reduce memory usage to ...

Thank you very much and sorry for my bad english ...

Ziath

+6
source share
2 answers

You need a specialized structure to achieve what you want.

Take a look at Apache PatriciaTrie . It is similar to Map , but it is memory friendly and works with String s. It is also very fast: O(k) operations, with k being the number of bits of the largest key.

It has an operation that suits your immediate needs: prefixMap() , which returns the SortedMap for a trie containing a String that has a given key prefix.

Short use example:

 public class Patricia { public static void main(String[] args) { PatriciaTrie<String> trie = new PatriciaTrie<>(); String world = "hello the world"; String people = "hello the people"; trie.put(world, null); trie.put(people, null); SortedMap<String, String> map1 = trie.prefixMap("hello"); System.out.println(map1.keySet()); // [hello the people, hello the world] SortedMap<String, String> map2 = trie.prefixMap("hello the w"); System.out.println(map2.keySet()); // [hello the world] SortedMap<String, String> map3 = trie.prefixMap("hello the p"); System.out.println(map3.keySet()); // [hello the people] } } 

There are also tests that contain more examples.

+5
source

Here I am first trying to explain why you are observing such excessive memory consumption and what you can do about it (if you want to stick to the HashMap ):

A HashMap created using the default constructor will have an initial capacity of 16. This means that it will have space for 16 entries, even if it is empty. In addition, you seem to be creating a map, regardless of whether you need it or not.

Thus, reduce memory consumption in your case will be

  • Create a map only when necessary
  • Create it with a smaller initial capacity

For your class, this might look something like this:

 public class BNGram { private int id; private Map<Integer,BNGram> next; public BNGram(int idP) { this.id = idP; // (Do not create a new `Map` here!) } void doSomethingWhereTheMapIsNeeded(Integer key, BNGram value) { // Create a map, when required, with an initial capacity of 1 if (next == null) { next = new HashMap<Integer, BNGram>(1); } next.put(key, value); } } 

But...

... conceptually, it is doubtful to have a large “tree-like” structure consisting of many, many maps, each of which has only “several” entries. This suggests that a different data structure is more appropriate here. Therefore, you definitely need to choose a solution such as Magnamag's answer , or (if this is not applicable to you, as indicated in your comments), pay attention to the alternative data structure - perhaps even formulating it as a new question that does not suffer from problems xy .

+4
source

Source: https://habr.com/ru/post/984082/


All Articles