Modeling n-grams with java hashmap

I need to model a collection of n-grams (sequences of n words) and their contexts (words that appear next to n-grams along with their frequency). My idea was this:

public class Ngram { private String[] words; private HashMap<String, Integer> contextCount = new HashMap<String, Integer>(); } 

Then, to count all the different n-grams, I use another Hashmap, for example

 HashMap<String, Ngram> ngrams = new HashMap<String, Ngram>(); 

and I add to it when I get the text. The problem is that when the number of n-grams exceeds 10,000 or so, the JVM heap is full (it is set to a maximum of 1.5 GB), and everything slows down very much.

Is there a better way to do this to avoid such memory consumption? In addition, contexts should be easily comparable between n-grams, which I'm not sure about my solution.

+6
source share
2 answers

You can use HADOOP MapReducer for a huge database (usually for Bigdata). use Mapper to separate input into Ngrams and combiner and mapper to do whatever you want to do with these Ngrams.

 HADOOP uses <Key,value> as like you wish to process with Hashmap. 

I think something like Classification. so it fits well. But this requires a cluster.

if possible, it is best to start with Hadoop. Ending Guide (Orielly Publications) .

0
source

You may have already found a solution to your problem, but this article has a very good approach to large-scale language models:

Smooth Color Filter Models: Tera-Scale LMs at Cheap

http://acl.ldc.upenn.edu/D/D07/D07-1049.pdf

0
source

Source: https://habr.com/ru/post/887379/


All Articles