You will need to express your data using only ASCII characters. Using Base64 is the most efficient method (available in the Python standard library) to do this in terms of ensuring that the binary data matches the printable text, which is also safe for UTF-8. Of course, displaying the same data requires 33% more space, but other methods take up more space.
You can combine this with compression to limit the amount of space occupied, but make compression optional (check the data) and only actually use it if the data is less.
import zlib import base64 def pack_utf8_safe(data): is_compressed = False compressed = zlib.compress(data) if len(compressed) < (len(data) - 1): data = compressed is_compressed = True base64_encoded = base64.b64encode(data) if is_compressed: base64_encoded = '.' + base64_encoded return base64_encoded def unpack_utf8_safe(base64_encoded): decompress = False if base64_encoded.startswith('.'): base64_encoded = base64_encoded[1:] decompress = True data = base64.b64decode(base64_encoded) if decompress: data = zlib.decompress(data) return data
The symbol '.' is not part of the Base64 alphabet, so I used it here to refer to compressed data.
You can continue shaving 1 or 2 = complementary characters from the end of Base64 encoded data; they can be added again at decoding (add '=' * (-len(encoded) * 4) at the end), but I'm not sure if it is worth it.
You can achieve additional savings by switching to Base85 encoding, ASCII 4 to 5 encoding for binary data, so 20% of the overhead. For Python 2.7, this is only available in an external library (Python 3.4 added to the base64 library ). You can use python-mom project in 2.7:
from mom.codec import base85
and replace all calls to base64.b64encode() and base64.b64decode() calls to base85.b85encode() and base85.b85decode() .
If you are 100% sure that nothing along the way will process your data as text (perhaps changing line breaks, interpreting and changing other control codes), you can also use Base128 encoding, reducing the overhead by 14.3% increases (8 characters for every 7 bytes). However, I cannot recommend the Python module installed for installation; There is a hosted GitHub module , but I have not tested it.