Manipulating very large binary strings in C #

I am working on a genetic algorithm project where I encode my chromosome in a binary string, where every 32 bits represent a value. The problem is that the functions that I optimize have more than 3000 parameters, which means that I have more than 96000 bits in my bit string, and the manipulations I do on this just slow down ...

I continued with the following: I have a binary class where I create a boolean array that represents my large binary string. Then I manipulate this binary string with various changes and movements.

My question is, is there a better way to do this? Speed ​​just kills. I am sure that it would be nice if I did this on only one bit string, but I need to do manipulations with 25-bit strings for more than 1000 generations.

+4
source share
5 answers

I would take a step back. Your analysis seems to be related to detailing the implementation, namely that you selected bool [] as the way you represent the string of bits.

Clear your mind of bools and arrays and make a complete list of the operations you really need to perform, how often they happen, and how fast they should be. Ideally, consider whether your speed requirement is average speed or worst speed. (There are many data structures that achieve high average speed, having one expensive operation for every thousand cheap operations, if any expensive operations are unacceptable, then you need to know what lies ahead.)

Once you have this list, you can research which data structures work well.

For example, suppose your list of operations is:

  • build bit sequences of order 32 bits
  • combine a sequence of 3000 bits in series to form new sequences of bits
  • add new bit sequences to existing long bit sequences in specific places quickly

Given only this list of operations, I think the data structure you want is a catenable deque . Catenable deques supports quick insertion at both ends and can be effectively split into two levels. Then it’s easy to insert the material in the middle of the deque: break the deque up, insert the element at the end of one half and rejoin them.

However, if you then add another operation to the problem, say, “find a specific bit string anywhere in the sequence of 90,000 bits, and find the result in sublinear time, ” and then just catenable deque is not going to do this. Search deck is slow. There are other data structures that support this operation.

+9
source

If I understood correctly, you are encoding a bit array in bool[] . The first obvious optimization would be to change this to int[] (or even long[] ) and, if possible, use bit operations for the entire machine word.

For example, this would make the shifts more efficient by ~ factor 4.

+2
source

Is a BitArray class without help?

+1
source

A BitArray will probably be faster than a boolean array, but you still won’t get the built-in support for shifting 96,000 bits.

GPUs are extremely good at massive bit operations. Perhaps Brahma , CUDA.NET or Accelerator might be useful?

0
source

Let me understand; you are currently using a sequence of 32-bit values ​​for the chromosome. Are we talking about DNA chromosomes or neuroevolutionary algorithmic chromosomes?

If it is DNA, you are dealing with 4 meanings; A, C, G, T. This can be encoded in 2 bits, forcing the byte to contain 4 values. Your 3000-element chromosome sequence can be stored in a byte array of 750 elements; that nothing really.

Your two most expensive operations relate to and from compressed bitstream. I would recommend an enumeration with a byte key:

 public enum DnaMarker : byte { A, C, G, T }; 

Then you go from 4 to byte with one operation:

 public static byte ToByteCode(this DnaMarker[] markers) { byte output = 0; for(byte i=0;i<4;i++) output = (output << 2) + (byte)markers[i]; } 

... and analyze them with something like this:

 public static DnaMarker[] ToMarkers(this byte input) { var result = new byte[4]; for(byte i=0;i<4;i++) result[i] = (DnaMarker)(input - (input >> (2*(i+1)))); return result; } 

You can see a slight increase in performance using four parameters (print them if necessary), and also select and use the array on the heap. But you lose the iteration, which makes the code more compact.

Now, because you are packing them into four-byte “blocks”, if the length of your sequence is not always a short four-fold, you will end up “filling” the end of your block with zero values ​​(A). The work around this is messy, but if you have a 32-bit integer that indicates the exact number of tokens, you can simply discard everything you find in the byte stream.

Hence the endless possibilities; you can convert an array of enumerations to a string, just call ToString () on each of them, and you can also feed a string and get an array of enumerations by repeating with Enum.Parse ().

And always remember, if the memory does not have premium (usually it is not), it almost always processes data faster in a user-friendly format, and not in the most compact format. The only big exception is network transmission; if you needed to send 750 bytes versus 12K via the Internet, the obvious advantage is a smaller size.

0
source

Source: https://habr.com/ru/post/1346075/


All Articles