Store arbitrary binary data on a system accepting only valid UTF8

I have arbitrary binary data. I need to save it on a system that expects a valid UTF8. It will never be interpreted as text, I just need to put it there and get it and restore my binary data.

Base64 will obviously work, but I can't have that kind of inflation.

How can I easily achieve this in python 2.7?

+5
source share
3 answers

You will need to express your data using only ASCII characters. Using Base64 is the most efficient method (available in the Python standard library) to do this in terms of ensuring that the binary data matches the printable text, which is also safe for UTF-8. Of course, displaying the same data requires 33% more space, but other methods take up more space.

You can combine this with compression to limit the amount of space occupied, but make compression optional (check the data) and only actually use it if the data is less.

import zlib import base64 def pack_utf8_safe(data): is_compressed = False compressed = zlib.compress(data) if len(compressed) < (len(data) - 1): data = compressed is_compressed = True base64_encoded = base64.b64encode(data) if is_compressed: base64_encoded = '.' + base64_encoded return base64_encoded def unpack_utf8_safe(base64_encoded): decompress = False if base64_encoded.startswith('.'): base64_encoded = base64_encoded[1:] decompress = True data = base64.b64decode(base64_encoded) if decompress: data = zlib.decompress(data) return data 

The symbol '.' is not part of the Base64 alphabet, so I used it here to refer to compressed data.

You can continue shaving 1 or 2 = complementary characters from the end of Base64 encoded data; they can be added again at decoding (add '=' * (-len(encoded) * 4) at the end), but I'm not sure if it is worth it.

You can achieve additional savings by switching to Base85 encoding, ASCII 4 to 5 encoding for binary data, so 20% of the overhead. For Python 2.7, this is only available in an external library (Python 3.4 added to the base64 library ). You can use python-mom project in 2.7:

 from mom.codec import base85 

and replace all calls to base64.b64encode() and base64.b64decode() calls to base85.b85encode() and base85.b85decode() .

If you are 100% sure that nothing along the way will process your data as text (perhaps changing line breaks, interpreting and changing other control codes), you can also use Base128 encoding, reducing the overhead by 14.3% increases (8 characters for every 7 bytes). However, I cannot recommend the Python module installed for installation; There is a hosted GitHub module , but I have not tested it.

+4
source

You can decode your bytes as 8859-1 data, which will always produce a valid Unicode string. Then you can encode it in UTF8:

 utf8_data = my_bytes.decode('iso8859-1').encode('utf8') 

On average, half of your data will be in the range 0-127, which is one byte in UTF8, and half of your data will be in the range 128-255, which is two bytes in UTF8, so your result will be 50% more than your input.

If you have any structure at all, then zlib compresses it, as Martijn suggests, can reduce the size.

0
source

If your application really needs you to represent 256 different byte values ​​in graphically distinguishable form, all you really need is 256 Unicode codes. The problem is resolved.

ASCII 33-127 codes are not easy, then Unicode codes 160-255 are also good candidates for introducing yourself, but you can exclude some of them that are difficult to distinguish (if you want OCR or people to process them reliably, Γ‘Γ₯Γ€ and Etc. may be too similar). Select the rest of the set of code points that can be represented in two bytes - a fairly large set, but again, many of them are graphically indistinguishable from other glyphs in most rendering.

This scheme does not attempt to perform any form of compression. I assume that you will get better results by compressing your data before encoding it if this is a problem.

0
source

Source: https://habr.com/ru/post/1201475/


All Articles