Java: BaseN universal encoder / decoder working with large data sizes

I am looking for a suitable BaseN encoder (with custom encoding) in Java, which is not limited by the size of the input data (byte array).

Something like that:

https://github.com/mklemm/base-n-codec-java

But for "unlimited" data length without undue penalty for memory / performance and the "BigInteger" abuse magic. Just something that works as standard BASE64 encoders, but is universal for any base / encoding. Any decision or idea on how to achieve this is welcome.

Maybe if someone has experience with Apache BaseNCodec:

https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/BaseNCodec.html

It looked promising, but it is an abstract class, and an accessible implementation is becoming harder to do than starting from scratch.


I need this for binary data for a custom character set (where the number of characters in the set is changed, "ABCDE" = Base5 , "ABCDE-+*/." = Base10 , ...).
Update: "Base N Codec" from GitHub (above) seems to be buggy, so in the end I used the following code:

https://dzone.com/articles/base-x-encoding

+5
source share
3 answers

General answer: None. Special case: Yes, for bases with a capacity of 2.

Why? Because the thoughts in Q are in "strong competition" (actually, probably, a "contradiction").

Can you determine the results of multiplication and division operations without performing multiplication and division calculations? NO This is a contradiction. When you get the results, by definition, you performed the calculation.

So it’s not a question of whether you can avoid calculations, but a question of how to arrange them.

  • If N and / or M are in bases with a power of 2, then multiplication / division can be calculated with a simple bit shift = the same calculation with the main stream lining. This can be done by avoiding BigInteger calculations.
  • Otherwise, you can cache some duplicate calculations by storing intermediate results in an array or HashMap, then you get the same calculations with optimization. But BigInteger calculations are still required (but avoid duplicates).

Hope this helps your approach. :)

+4
source

The basic encoding of N is quite effective if N is a power of 2, since then conversion between groups of a fixed size of digits and a fixed size of bytes can occur.

Base64: 2 6 - 6 bits per digit, therefore 4 digits = 24 bits = 3 bytes.

Otherwise, school multiplication must occur along the entire length, resulting in the calculation of "BigInteger".

a bit faster than, for example, multiplying / dividing by base N many times, has an array of degrees N.

To encode a byte array for digits, you can use N 0 N 1 N 2 N 3 , ... in the form of byte arrays of shorter or equal length and repeated subtractions.

As byte signs, short may be more appropriate. Say, if the high byte of a number is 98, and the smaller N-power is 12, then about 7 is a digit.

The same permissions can be used to decode the digits into a byte array.

Good luck.

+4
source

You mentioned two very different approaches. The BaseN algorithm used in the Github implementation uses the mathematical notation of integer conversion between bases. This is equivalent to stating that 10 is the same as 12 in base 8 arithmetic or 1010 in base-2 arithmetic. The algorithm interprets the byte stream as a large number and converts it to the designated base.

Base64 is a completely different approach, and you can see an example on the Base64 Wikipedia page . The algorithm basically splits the input stream into an array of 6 bits per element. 2 ^ 6 = 64, so the name is Base64. It has a table with 64 different characters and maps each element in the array (6 bits) to the corresponding conversion table.

I think that you need to choose one of two approaches, since they are very different and incompatible with each other. As for implementation details, if you choose the second method, it will be easier to implement, I think, since you basically split the stream into parts of a fixed size and encode it according to your own table.

The first method can become quite complicated, since arbitrary arithmetic operations are based on rather complex constructions. You can look at the existing software, again @ Arithmetic software list of arbitrary precision .

Actually, I think that at some point it will be difficult for you to get the symbols for your conversions (as the base rises or the number of bits increases), unless you use the whole Unicode alphabet :).

I hope I helped a little

+1
source

Source: https://habr.com/ru/post/1259505/


All Articles