Of course you can. There are some very simple approaches to the problem that I can think of:
Partial decoding
Each base64 character encodes 6 input bits, so you can relate them as follows:
Base64: AAAAAABBBBBBCCCCCCDDDDDDEEEEEEFFFFFFGGGGGGHHHHHH Data: xxxxxxxxyyyyyyyyzzzzzzzzqqqqqqqqwwwwwwwweeeeeeee
If you want to extract 4 bytes of data starting at offset 1, for example:
................................ Base64: AAAAAABBBBBBCCCCCCDDDDDDEEEEEEFFFFFFGGGGGGHHHHHH Data: xxxxxxxxyyyyyyyyzzzzzzzzqqqqqqqqwwwwwwwweeeeeeee ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Then, to decode only the parts you want, you need to know the distance bit. They are easy to compute, just multiply your byte distances by 8. Now that you know you want 32 bits, starting at bit 8, you can find which base64 character contains your start bits. To do this, divide offset and offset+length by 6:
start = bit 8 = char 1 + bit 2 end = bit 40 = char 6 + bit 4
Well, this is comparable to the above diagram: your range starts after 1 full base64 char and 2 bits and ends after 6 full base64 characters and 4 bits.
Now that you know the exact base64 characters you need, you need to decode them. To do this, it makes sense to use existing base64 decoders, so we will not need to deal with base64 encoding. And to do this, you should know that every 4 characters of base64 code corresponds to 3 bytes of data. So, here's the trick: you can add and add gibberish to your extracted base code, until the base64 and byte boundaries are aligned - and knowing which invalid input the base64 decoder will generate, throw away the excess.
So, how much to add depends on the value of the remainder of the bits. If the residual bit of the bit is 0, this means that A and x aligned, so no changes are required:
|
If the remainder bit is 2, you need to add one base64 char and throw out one leading byte after decoding:
#
If the remainder of the bits is 4, you need to add two base64 characters and throw out the two leading bytes after decoding:
#
The same goes for trailing. If the final bit of the remainder is zero, no change:
If the end bit of the bit is 2, you need to add two base64 characters and throw out two bytes:
=========
If the remainder at the end of the bit is 4, you need to add one base64 char and throw out one final byte:
===============
So, for the synthetic example above, one character should be added (instead of A ) and one character added (instead of H ):
................................ Base64: ??????BBBBBBCCCCCCDDDDDDEEEEEEFFFFFFGGGGGG?????? Data: ????????yyyyyyyyzzzzzzzzqqqqqqqqwwwwwwww???????? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Now, after decoding, throw out the extra bytes from the head and tail, and you're done.
Practical example
Imagine you have magic like ?PNG\r\n??????IHDR . Then, to check if the base64 encoded string matches your magic, you can determine the bytes in the magic that are known and their bit offsets and lengths:
"PNG\r\n" -> offset = 8, length = 40 "IHDR" -> offset = 96, length = 32
So, using our ideas above:
"PNG\r\n" -> start = 8 ( char 1, bits = 2 ), end = 48 ( char 8, bits = 0 ) "IHDR" -> start = 96 ( char 16, bits = 0 ), end = 128 ( char 21, bits = 2 )
To decode the "PNG\r\n" , you need to take 7 full base64 characters starting with char 1, then add 1 char, decode, throw 1 leading byte and compare.
To decode the "IHDR" , you need to take 6 base64 characters, starting with char 16, then add 2 characters, decode, throw 2 bytes and compare them.
Translate magic
An alternative approach to what I described above, instead of translating data, translates magic.
So, if you have some magic ?PNG\r\n??????IHDR (I replaced \r and \n for presentation purposes), as in the example above, when it is encoded in base64, it looks like this:
Data: [?PN] [Grn] [???] [???] [IHD] [R??] Base64: (?~BO) (Rw0K) (????) (????) (SUhE) (Ug==)
In the ?~BO part, the sign ~ is only partially random. Letβs look at this construction in a dimensional manner:
Data: ????????PPPPPPPPNNNNNNNN Base64: ??????~~~~~~BBBBBBOOOOOO
So, only the two least significant bits ~ really unknown, which means that you can use this information when checking magic against data to narrow the scope of magic.
In this particular case, here is a complete list of all encodings:
Data: ??????00PPPPPPPPNNNNNNNN Base64: ??????FFFFFFBBBBBBOOOOOO => ?FBO Data: ??????01PPPPPPPPNNNNNNNN Base64: ??????VVVVVVBBBBBBOOOOOO => ?VBO Data: ??????10PPPPPPPPNNNNNNNN Base64: ??????llllllBBBBBBOOOOOO => ?lBO Data: ??????11PPPPPPPPNNNNNNNN Base64: ??????111111BBBBBBOOOOOO => ?1BO
The same applies to the final group R?? but due to the presence of 4 undefined bits instead of 2, the list of permutations is longer:
Ug?? <= 0000???? ???????? Uh?? <= 0001???? ???????? Ui?? <= 0010???? ???????? Uj?? <= 0011???? ???????? Uk?? <= 0100???? ???????? Ul?? <= 0101???? ???????? Um?? <= 0110???? ???????? Un?? <= 0111???? ???????? Uo?? <= 1000???? ???????? Up?? <= 1001???? ???????? Uq?? <= 1010???? ???????? Ur?? <= 1011???? ???????? Us?? <= 1100???? ???????? Ut?? <= 1101???? ???????? Uu?? <= 1110???? ???????? Uv?? <= 1111???? ????????
So, in regexp, your base64-magic for ?PNG\r\n??????IHDR will look like this:
rx = re.compile(b'^.[FVl1]BORw0K........SUhEU[gv]') if rx.match(base64.b64encode(b'xPNG\r\n123456IHDR789foobar')): print('Yep, it works!')