Python, can someone guess the file type only by its base64 encoding?

Question

Python, can someone guess the file type only by its base64 encoding?

Say I have the following:

image_data = """iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg=="""

This is just a point image (from https://en.wikipedia.org/wiki/Data_URI_scheme ). But I do not know if this is an image or text, etc. Is it possible to understand that this is only with this encoded string? I try this in Python, but this is also a general question. Therefore, any understanding in both cases is very welcome.

+5

python base64

george Dec 15 '15 at 11:17

source share

3 answers

Of course you can. There are some very simple approaches to the problem that I can think of:

Partial decoding

Each base64 character encodes 6 input bits, so you can relate them as follows:

 Base64: AAAAAABBBBBBCCCCCCDDDDDDEEEEEEFFFFFFGGGGGGHHHHHH Data: xxxxxxxxyyyyyyyyzzzzzzzzqqqqqqqqwwwwwwwweeeeeeee

If you want to extract 4 bytes of data starting at offset 1, for example:

  ................................ Base64: AAAAAABBBBBBCCCCCCDDDDDDEEEEEEFFFFFFGGGGGGHHHHHH Data: xxxxxxxxyyyyyyyyzzzzzzzzqqqqqqqqwwwwwwwweeeeeeee ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Then, to decode only the parts you want, you need to know the distance bit. They are easy to compute, just multiply your byte distances by 8. Now that you know you want 32 bits, starting at bit 8, you can find which base64 character contains your start bits. To do this, divide offset and offset+length by 6:

 start = bit 8 = char 1 + bit 2 end = bit 40 = char 6 + bit 4

Well, this is comparable to the above diagram: your range starts after 1 full base64 char and 2 bits and ends after 6 full base64 characters and 4 bits.

Now that you know the exact base64 characters you need, you need to decode them. To do this, it makes sense to use existing base64 decoders, so we will not need to deal with base64 encoding. And to do this, you should know that every 4 characters of base64 code corresponds to 3 bytes of data. So, here's the trick: you can add and add gibberish to your extracted base code, until the base64 and byte boundaries are aligned - and knowing which invalid input the base64 decoder will generate, throw away the excess.

So, how much to add depends on the value of the remainder of the bits. If the residual bit of the bit is 0, this means that A and x aligned, so no changes are required:

  |========================== Base64: ...AAAAAABBBBBBCCCCCCDDDDDD... Data: ...xxxxxxxxyyyyyyyyzzzzzzzz... |==========================

If the remainder bit is 2, you need to add one base64 char and throw out one leading byte after decoding:

  ##|================== Base64: ...AAAAAABBBBBBCCCCCCDDDDDD... Data: ...xxxxxxxxyyyyyyyyzzzzzzzz... |==================

If the remainder of the bits is 4, you need to add two base64 characters and throw out the two leading bytes after decoding:

  ####|========== Base64: ...AAAAAABBBBBBCCCCCCDDDDDD... Data: ...xxxxxxxxyyyyyyyyzzzzzzzz... |==========

The same goes for trailing. If the final bit of the remainder is zero, no change:

  ===| Base64: ...AAAAAABBBBBBCCCCCCDDDDDD... Data: ...xxxxxxxxyyyyyyyyzzzzzzzz... ===|

If the end bit of the bit is 2, you need to add two base64 characters and throw out two bytes:

  =========##| Base64: ...AAAAAABBBBBBCCCCCCDDDDDD... Data: ...xxxxxxxxyyyyyyyyzzzzzzzz... ===========|

If the remainder at the end of the bit is 4, you need to add one base64 char and throw out one final byte:

  ===============####| Base64: ...AAAAAABBBBBBCCCCCCDDDDDD... Data: ...xxxxxxxxyyyyyyyyzzzzzzzz... ===================|

So, for the synthetic example above, one character should be added (instead of A ) and one character added (instead of H ):

  ................................ Base64: ??????BBBBBBCCCCCCDDDDDDEEEEEEFFFFFFGGGGGG?????? Data: ????????yyyyyyyyzzzzzzzzqqqqqqqqwwwwwwww???????? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Now, after decoding, throw out the extra bytes from the head and tail, and you're done.

Practical example

Imagine you have magic like ?PNG\r\n??????IHDR . Then, to check if the base64 encoded string matches your magic, you can determine the bytes in the magic that are known and their bit offsets and lengths:

 "PNG\r\n" -> offset = 8, length = 40 "IHDR" -> offset = 96, length = 32

So, using our ideas above:

 "PNG\r\n" -> start = 8 ( char 1, bits = 2 ), end = 48 ( char 8, bits = 0 ) "IHDR" -> start = 96 ( char 16, bits = 0 ), end = 128 ( char 21, bits = 2 )

To decode the "PNG\r\n" , you need to take 7 full base64 characters starting with char 1, then add 1 char, decode, throw 1 leading byte and compare.

To decode the "IHDR" , you need to take 6 base64 characters, starting with char 16, then add 2 characters, decode, throw 2 bytes and compare them.

Translate magic

An alternative approach to what I described above, instead of translating data, translates magic.

So, if you have some magic ?PNG\r\n??????IHDR (I replaced \r and \n for presentation purposes), as in the example above, when it is encoded in base64, it looks like this:

 Data: [?PN] [Grn] [???] [???] [IHD] [R??] Base64: (?~BO) (Rw0K) (????) (????) (SUhE) (Ug==)

In the ?~BO part, the sign ~ is only partially random. Let’s look at this construction in a dimensional manner:

 Data: ????????PPPPPPPPNNNNNNNN Base64: ??????~~~~~~BBBBBBOOOOOO

So, only the two least significant bits ~ really unknown, which means that you can use this information when checking magic against data to narrow the scope of magic.

In this particular case, here is a complete list of all encodings:

 Data: ??????00PPPPPPPPNNNNNNNN Base64: ??????FFFFFFBBBBBBOOOOOO => ?FBO Data: ??????01PPPPPPPPNNNNNNNN Base64: ??????VVVVVVBBBBBBOOOOOO => ?VBO Data: ??????10PPPPPPPPNNNNNNNN Base64: ??????llllllBBBBBBOOOOOO => ?lBO Data: ??????11PPPPPPPPNNNNNNNN Base64: ??????111111BBBBBBOOOOOO => ?1BO

The same applies to the final group R?? but due to the presence of 4 undefined bits instead of 2, the list of permutations is longer:

 Ug?? <= 0000???? ???????? Uh?? <= 0001???? ???????? Ui?? <= 0010???? ???????? Uj?? <= 0011???? ???????? Uk?? <= 0100???? ???????? Ul?? <= 0101???? ???????? Um?? <= 0110???? ???????? Un?? <= 0111???? ???????? Uo?? <= 1000???? ???????? Up?? <= 1001???? ???????? Uq?? <= 1010???? ???????? Ur?? <= 1011???? ???????? Us?? <= 1100???? ???????? Ut?? <= 1101???? ???????? Uu?? <= 1110???? ???????? Uv?? <= 1111???? ????????

So, in regexp, your base64-magic for ?PNG\r\n??????IHDR will look like this:

 rx = re.compile(b'^.[FVl1]BORw0K........SUhEU[gv]') if rx.match(base64.b64encode(b'xPNG\r\n123456IHDR789foobar')): print('Yep, it works!')

+2

toriningen Dec 15 '15 at 12:55

source share

His PNG image

 import base64 encoded_string = 'iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg==' decoded_string = base64.b64decode(encoded_string) print 'Decoded :', decoded_string

Output:

 python base_decode.py Decoded :  PNG

-2

devnull Dec 15 '15 at 11:30

source share

Martijn pieters · Accepted Answer · 2015-12-15T11:24:31+0000

You cannot, at least not without decoding, because the bytes that help identify the file type are propagated by base64 characters, which do not directly align with all bytes. Each character encodes 6 bits , which means that for every 4 characters 3 bytes are encoded.

File type identification requires access to these bytes in different block sizes. For example, a JPEG image can be identified from FF D8 or FF D9 bytes, but these are two bytes; the next third byte should also be encoded as part of a 4-character block.

What you can do is simply decode the base64 string to fingerprint your file. So you can decode the first 4 characters to get 3 bytes, and then use the first two to see if the object is a JPEG image. A large number of file formats can be identified only from the first or last series of bytes (a PNG image can be identified using the first 8 bytes, GIF by the first 6, etc.). Decoding only these bytes from the base64 string is trivial.

Your sample is a PNG image; you can test image types with the imghdr module:

 >>> import imghdr >>> image_data = """iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg==""" >>> sample = image_data[:44].decode('base64') # 33 bytes / 3 times 4 is 44 base64 chars >>> for tf in imghdr.tests: ... res = tf(sample, None) ... if res: ... break ... >>> print res png

I used only the first 33 bytes from base64 data to repeat what the imghdr.what() function will read from the file you are transmitting (it reads 32 bytes, but this number does not divide by 3).

There is an equivalent soundhdr module , and there is also a python-magic project that allows you to pass the number of bytes to determine the file type.

Python, can someone guess the file type only by its base64 encoding?

Partial decoding

Practical example

Translate magic

More articles: