Array of abbreviated integers

Just so as not to invent hot water, I ask here ...

I have an application with a lot of arrays and it runs out of memory.

So the idea is to compress the List<int> to something else that will have the same interface (e.g. IList<T> ), but instead of int I could use short integers.

For example, if my range of values ​​is 0 - 100,000,000, I only need ln2 (1,000,000) = 20 bits. Therefore, instead of storing 32 bits, I can trim the excess and reduce memory requirements by 12/32 = 37.5%.

Do you know about the implementation of such an array. C ++ and java would be fine too, as I could easily convert them to C #.

Additional requirements (as everyone starts to get from the idea):

  • integers in ARE unique list
  • they do not have a special property, so they are not compressed in any other way, and then reduce the number of bits
  • If the range of values ​​is one million, for example, lists will have from 2 to 1000 elements in size, but there will be many, so BitSets
  • the new data container should behave like an overridable array (relative to the O () method)

EDIT:

Please do not tell me DO NOT do this. The requirement for this is well thought out, and this is the only option that remains.

In addition, a 1M range of values ​​and 20 bits for it is ONLY EXAMPLE. I have cases with all the different ranges and whole sizes.

In addition, I could have even shorter integers, for example 7 bit integers, then the packaging would be

 00000001 11111122 22222333 33334444 444..... 

for the first 4 items packed in 5 bytes.


Almost finished coding it - will be published soon ...

+4
source share
4 answers

Since you can only allocate memory in byte quanta, you are essentially asking if / how you can put integers in 3 bytes instead of 4 (but see No. 3 below). This is not a good idea .

  • Since there is no integer type of 3 bytes in size, you will need to use something else (for example, an opaque 3-byte buffer) in its place. This would require that you wrap all access to the contents of the list in the code that performs the conversion, so that you can put "ints" and pull out "ints" anyway.
  • Depending on the architecture and memory allocator, requesting 3-byte fragments may not affect the amount of memory in your program at all (it may just clog your heap with unused 1-byte "holes").
  • Re-executing the list from scratch to work with an opaque byte array, since its backup storage will avoid the two previous problems (and this may also allow you to compress every last bit of memory instead of whole bytes), but this is a high order and quite error prone.

Instead, you can try something like:

  • Do not store all this data in memory at the same time. With 4 bytes per int, you need to reach hundreds of millions of integers before the memory runs out. Why do you need all of them at the same time?
  • Compression of the data set by the inability to save duplicates. There are some if you are up to hundreds of millions.
  • Changing the data structure so that it retains the differences between sequential values ​​(deltas), if possible. It may not be very difficult, but you can really expect something at the stadium with a 50% improvement (this may not be enough), and this will completely destroy your ability to index to the list at all times.
+3
source

One option that will get from 32 bits to 24 bits is to create a custom structure that stores an integer of three bytes:

 public struct Entry { byte b1; // low byte b2; // middle byte b3; // high public void Set(int x) { b1 = (byte)x; b2 = (byte)(x >> 8); b3 = (byte)(x >> 16); } public int Get() { return (b3 << 16) | (b2 << 8) | b1; } } 

Then you can simply create a List<Entry> .

 var list = new List<Entry>(); var e = new Entry(); e.Set(12312); list.Add(e); Console.WriteLine(list[0].Get()); // outputs 12312 
+1
source

It reminds me of base64 and similar types of binary text encoding . They take 8-bit bytes and execute a bunch of bit-bits to pack them into 4-, 5-, or 6-bit printable characters. It also reminds me of the standard Zork information exchange code (ZSCII), which contains 3 letters in 2 bytes, where each letter takes 5 bits. It looks like you want to take a bunch of 10- or 20-bit integers and pack them into a buffer of 8-bit bytes.

The source code is available for many libraries that process a packed array of single bits ( a b c d e ).

Perhaps you could (a) download this source code and change the source (starting with some BitArray or another packed encoding), recompile to create a new library that handles the packaging and unpacking of 10- or 20-bit integers, rather than individual bits . This may take less time for programming and testing (b) to write a library that works the same way as (a) from the outside, but internally it splits 20-bit integers into 20 separate bits and then saves them using the (unmodified) class BitArray

+1
source

Change Given that your integers are unique, you can do the following: store unique integers until the number of integers you store is half the maximum number. Then switch to storing integers that you don’t have. This will reduce space by 50%.

It might be worth trying other simplification methods before trying to use 20-bit ints.

How do you handle duplicate integers? If you have a lot of duplicates, you can reduce the storage size by storing integers in Dictionary<int, int> , where the keys are unique integers and the values ​​are the corresponding values. Note that this assumes that you do not need the order of your integers.

Are your integers unique? Perhaps you store many unique integers in the range from 0 to 100 million. In this case, you can try to save integers that you don’t have. Then, when you determine if there is an integer i , just ask if there is any in your collection.

0
source

Source: https://habr.com/ru/post/1387126/


All Articles