I need to model a collection of n-grams (sequences of n words) and their contexts (words that appear next to n-grams along with their frequency). My idea was this:
public class Ngram { private String[] words; private HashMap<String, Integer> contextCount = new HashMap<String, Integer>(); }
Then, to count all the different n-grams, I use another Hashmap, for example
HashMap<String, Ngram> ngrams = new HashMap<String, Ngram>();
and I add to it when I get the text. The problem is that when the number of n-grams exceeds 10,000 or so, the JVM heap is full (it is set to a maximum of 1.5 GB), and everything slows down very much.
Is there a better way to do this to avoid such memory consumption? In addition, contexts should be easily comparable between n-grams, which I'm not sure about my solution.
source share