How to reduce memory usage for data structure HashMap <String, Integer>

Before explaining my problem, I must mention that I am not looking for a way to increase the memory of the Java heap. I must strictly store these objects.

I am working on storing a huge amount (5-10 GB) of DNA sequences and their number (integer) in a hash table. DNA sequences (32 or less) are composed of the characters "A", "C", "G", "T" and "N" (undefined). As you know, when storing a large number of objects in memory, Java has low usage efficiency compared to lower-level languages ​​such as C and C ++. Thus, if I save this sequence as a string (it contains about 100 MB of memory for a sequence with a length of ~ 30), I see an error.

I tried to imagine nucleic acids as 'A' = 00, 'C' = 01, 'G' = 10, 'T' = 11 and neglected 'N' (since it destroys char until 2-bit turns into 5th acid) . Then combine these 2-bit acids into an array of bytes. This brought some improvement, but, unfortunately, I see the error again after a couple of hours. I need a convenient solution, or at least a workaround to fix this error. Thank you in advance.

+5
source share
3 answers

Being quite complicated, perhaps this is a strange idea, and it will take quite a bit of work, but I will try:

You have already indicated two separate subtasks of your common task:

  • The default implementation of HashMap may be suboptimal for such a large collection.
  • you need to save something else besides the lines

Card implementation

I would recommend writing a realistic hash map implementation for the Map<String, Long> interface. Internally you do not need to store strings. Unfortunately, 5 ^ 32> 2 ^ 64, so there is no way to pack your entire string in one long, well, let us insert two long keys. You can do the conversion of long [/] long [2] quite efficiently on the fly by providing a string key for your map implementation (use bit shifts, etc.).

Regarding the packaging of values, here are a few considerations:

for a key-value pair, the standard hash map must have an array of N longs for buckets, where N is the current capacity, when the bucket is found from the hash key, it will need to have a linked key list -value of the pair to resolve the keys that generate identical hash codes. For your specific case, you can try to optimize it as follows:

  • use long [] of size 3N , where N is the ability to store both keys and values ​​in a continuous array
  • in this array, in the places 3 * (hashcode % N) and 3 * (hashcode % N) + 1 you store a long [2] representation of the key, the first key corresponding to this bucket or the only one (upon insertion, zero otherwise), in location 3 * (hashcode % N) + 2 you store the corresponding counter
  • for all cases when another key leads to the same hash code and, thus, to the same bucket, save the data in the standard HashMap<Long2KeyWrapper, Long> . The idea is to preserve the capacity of the array mentioned above (and resize accordingly) so as to actually have most of the data in this adjacent array, and not in the return map. This will significantly reduce the costs of storing a hash map.
  • do not increase capacity in N = 2N iterations, take smaller growth steps, for example. 10-20%. it will cost performance when filling up the card, but it will keep your memory footprint under control.

Keys

Given the inequality 5^32 > 2^64 , your idea of ​​using bits to encode 5 letters seems to be the best I can think of right now. Use 3 bits and correspondingly long [2].

+1
source

I recommend that you familiarize yourself with the Trove4j Collections API; he offers collections that contain primitives that will use less memory than their shell classes.

In particular, you should check their TObjectIntHashMap .

Also, I would not recommend storing anything as a String or char until JDK 9 is released, since the char array under String is encoded in UTF-16 encoding, using two byte per char . JDK 9 uses UTF-8 by default, where only one byte .

+2
source

If you use ~ 10gb of data or at least ~ 10gb of memory, you may need to think about ways to write data that you don’t need at the moment to disk and load parts of your data set into memory in order to work on them.

I had this exact problem a few years ago when I was doing research with monte carlo simulations, so I wrote a Java data structure to solve this. You can clone / call the source here: github.com/tylerparsons/surfdep

The library supports both MySQL and SQLite as a base database. If you don’t have one, I would recommend SQLite as it is much faster to configure.

Full caveat: this is not the most efficient implementation, but it will process very large data sets if you let it work for several hours. I tested it with matrices of up to 1 billion elements on my Windows laptop.

0
source

Source: https://habr.com/ru/post/1266722/


All Articles