Big vocabulary implementation in Java

Question

Big vocabulary implementation in Java

I am in the middle of a Java project that will use a "big dictionary" of words. By "dictionary" I mean specific numbers (int) assigned by Strings. And “big” I mean a file of the order of 100 MB. The first solution I came up with is perhaps the simplest. Upon initialization, I read the entire file and create a large HashMap, which will later be used to view the lines.

Is there an effective way to do this without having to read the entire file during initialization? Maybe not, but what if the file is really large, let's say, in RAM access order? So basically I'm looking for a way to efficiently look for things in a large dictionary stored in memory.

Thanks for the answers so far, as a result, I realized that I could be more specific in my question. As you probably guessed, the application is related to textual intellectual analysis, in particular, representing text as a sparse vector (although some of them had other inventive ideas :)). So, what is important to use is the ability to search for strings in a dictionary, and get their keys as quickly as possible. The initial overhead of reading the dictionary file or indexing it into the database is not so important as long as the search time for the strings is optimized. Again, suppose the dictionary size is large, comparable to the size of available RAM.

+6

java performance dictionary

tsotsi Sep 29 '14 at 20:11

source share

4 answers

At a time when your data structure is several hundred MB for RAM orders, you better not initialize the data structure at runtime, but rather use a database that supports indexing (which is most the case these days). Indexing will be one of the only ways to provide a quick text search after the file becomes so large and you encounter it - Xmx settings of your JVM. This is because if your file is as large or much larger than your maximum size, you will inevitably fail in your JVM .

As in order to read the entire file during initialization. Ultimately, you will have to do this so that you can efficiently search and analyze text in your code. If you know that you are only going to search for a specific part of your file at a time, you can implement lazy loading . If not, you can also bite the bullet and upload the entire file to the database in poverty. You can implement parallelism in this process if there are other parts of your code execution that are independent of this.

Please let me know if you have any questions!

+2

Devarsh desai Sep 29 '14 at 20:19

source share

As stated in the comment, Trie save you a lot of memory.

You should also use byte instead of char , as this will save you a factor of 2 for plain ASCII text or when using your national encoding if it has no more than 256 different letters.

At first glance, combining this low-level optimization with attempts does not make sense, since with them the size of the node is dominated by pointers. But there is a way if you want to go low.

So, what is important for use is the ability to look at lines in the dictionary, quickly get their keys.

Then forget about any database, as they are slow compared to HashMap s.

If it does not fit into memory, the cheapest solution, as a rule, becomes more. Otherwise, consider downloading only the most common words and doing something slower for others (like a memory mapped file).

I was asked to indicate a successful implementation, especially a bunch. I know nothing.

Assuming the OP doesn't need volatility, especially without key volatility, everything looks very simple.

I assume that the entire dictionary can easily be packaged into one ByteBuffer . Assuming mostly ASCII and a little hack, the arrow will need 1 byte per arrow label character and 1-5 bytes for the child pointer. The child pointer would be relative (i.e., the difference between the current node and the child), which would make most of them fit into one byte if it is stored in base 128 .

I can only assume the total memory consumption, but I would say something like <4 bytes per word. The above compression will slow down the search, but is still not suitable for accessing a single drive.

+2

maaartinus Sep 29 '14 at 21:36

source share

Sounds too much to store in memory. Either store it in a relational database (easily and with a hash index, so quickly), or in a NoSQL solution, such as Solr (a small learning curve, very fast).

Despite the fact that NoSQL is very fast, if you really want to tune performance, and there are records that are more often viewed than others, consider using a limited size cache to store the most recently used (say) 10,000 search queries.

0

Bohemian Sep 29 '14 at 10:47

source share

leventov · Accepted Answer · 2014-09-30T00:48:05+0000

Consider ChronicleMap ( https://github.com/OpenHFT/Chronicle-Map ) in non-replicated mode. This is an implementation of the Java Map , distinct from the heap, or, from another point of view, the NoSQL superstar keystore.

What is useful for your task out of the box:

Disk resilience through memory mapped files (see Michał Kosmulski comment)
Lazy load (drives load only on demand) → quick start
If the amount of data is larger than the available memory, the operating system will automatically disable rarely used pages.
Several JVMs can use the same card, since inactive memory is used at the OS level. Useful if you are processing within the framework of a map-reducing structure, for example. Mr. Hadoop.
Strings are stored in UTF-8 form, → ~ 50% memory savings if the strings are mostly ASCII (as noted by maaartinus)
int or long accept only 4 (8) bytes, for example, if you have a primitive specialized implementation of a map.
Very little recording overhead, much less than standard HashMap and ConcurrentHashMap
A good configurable concurrency with blocking if you already need to, or are going to parallelize text processing in the future.

Big vocabulary implementation in Java

More articles: