Modeling n-grams with java hashmap

Question

Modeling n-grams with java hashmap

I need to model a collection of n-grams (sequences of n words) and their contexts (words that appear next to n-grams along with their frequency). My idea was this:

public class Ngram { private String[] words; private HashMap<String, Integer> contextCount = new HashMap<String, Integer>(); }

Then, to count all the different n-grams, I use another Hashmap, for example

 HashMap<String, Ngram> ngrams = new HashMap<String, Ngram>();

and I add to it when I get the text. The problem is that when the number of n-grams exceeds 10,000 or so, the JVM heap is full (it is set to a maximum of 1.5 GB), and everything slows down very much.

Is there a better way to do this to avoid such memory consumption? In addition, contexts should be easily comparable between n-grams, which I'm not sure about my solution.

+6

java string hashmap n-gram

Nikola May 05, '11 at 15:09

source share

2 answers

Dineshkumar · Answer 1 · 2013-05-10T16:15:53+0000

You can use HADOOP MapReducer for a huge database (usually for Bigdata). use Mapper to separate input into Ngrams and combiner and mapper to do whatever you want to do with these Ngrams.

 HADOOP uses <Key,value> as like you wish to process with Hashmap.

I think something like Classification. so it fits well. But this requires a cluster.

if possible, it is best to start with Hadoop. Ending Guide (Orielly Publications) .

Elmer · Answer 2 · 2013-05-10T18:47:56+0000

You may have already found a solution to your problem, but this article has a very good approach to large-scale language models:

Smooth Color Filter Models: Tera-Scale LMs at Cheap

http://acl.ldc.upenn.edu/D/D07/D07-1049.pdf

Modeling n-grams with java hashmap

More articles: