Python Huffman Encoding Performance Measurement

Question

Python Huffman Encoding Performance Measurement

I have the following line that I would like to encode Huffman and store efficiently in a bitmap:

>>> print sequence GTCAGGACAAGAAAGACAANTCCAATTNACATTATG|

The symbol frequencies in the sequence are:

 >>> print freqTuples [(0.40540540540540543, 'A'), (0.1891891891891892, 'T'), (0.16216216216216217, 'C'), (0.16216216216216217, 'G'), (0.05405405405405406, 'N'), (0.02702702702702703, '|')]`

I translated this into a Huffman code dictionary:

 >>> print codeDict {'A': '1', 'C': '010', 'G': '001', 'N': '0110', 'T': '000', '|': '0111'}

Then I used the Python bitstring package to translate a string, character by character, into an instance of the BitArray class, which I call BitArray , which contains bits for each character encoded with the corresponding Huffman code

 >>> print bitArray.bin 0b001000010100100110101100111100110101101100000100101100000001101010100000010000010111

Here is the array bit in bytes:

 >>> print bitArray.tobytes() !I\254\363[^D\260^Z\240Ap

I have to use tobytes() instead of bytes , since the tobytes() generated by the multiplier does not divide evenly into 8-bit segments.

When I calculate the storage efficiency of the BitArray (the ratio of the sizes of the bit array and the input string), I get worse performance than if I left the input string unoccupied:

 >>> sys.getsizeof(bitArray.tobytes()) / float(len(sequence)) 1.2972972973

Am I measuring storage efficiency correctly? (If I code longer input lines, this ratio improves, but seems to be approaching an asymptotic limit of about 0.28. I would like to confirm whether this can be measured correctly.)

Edit

The following two approaches give different answers:

 >>> print len(bitArray.tobytes()) / float(len(mergedSequence)) 0.297297297297 >>> print bitArray.len / (8.*len(mergedSequence)) 0.283783783784

I'm not sure what to believe. But in the process of writing data to the storage, I think I need a representation of bytes, which makes me inclined to choose the first result.

+4

python compression huffman-code bitarray bitstring

Alex reynolds Nov 07 '11 at 23:26

source share

3 answers

I'm not quite sure about the stuff of bitrari, but shouldn't you just do:

 >>> len(bitArray.tobytes()) / float(len(sequence))

I am not saying that this will solve your problem, but it may be that the "getizeof" thing (again, I really am not very familiar with this) throws you off.

From what you wrote there, it looks like you are comparing apples to oranges.

+2

Steve juranich Nov 07 '11 at 23:49

source share

You know that the answer is incorrect because the Huffman dictionary is less than 4 bits per character, so the real answer should be less .5. If the vocabulary and symbol rate do not change for longer lines, then the compression ratio should not decrease to the asymptotic limit as the line becomes longer.

From the sys documentation:

 "getsizeof() calls the object's __sizeof__ method and adds an additional garbage collector overhead if the object is managed by the garbage collector."

You will need a function that will return the length of the bit string itself, and not the bit string + overhead. The BitString documentation says that the len or length property returns the length in bits. Try to do:

 bitArray.len / 8.*len(sequence)

+1

Dave Nov 08 '11 at 0:06

source share

agf · Accepted Answer · 2011-11-07T23:52:28+0000

 >>> sys.getsizeof(bitArray.tobytes()) / float(len(sequence)) 1.2972972973

The encoded version is estimated to be 30% longer than the original.

I don't think you want to use getsizeof here - if you want to minimize the size of the Python object, you should use getsizeof(sequence) , not len .

If instead you want to do what is intended for Huffman encoding and minimize the binary representation, then you want to use len for both (assuming the sequence is represented as one byte per character).

So your real ratio is 11/37.

I assume that you are using Huffman encoding as an exercise, since this does not seem like a logical way to efficiently store what is just a four-bit code with a termination character. At the very least, it would be better to use arithmetic coding, which allows you to use base-5 encoding instead of base-2, which is optimal for 5 possible characters.

In fact, I would suggest that in a sequence long enough to be worth compressing, there is a known G: A: C: T ratio and / or a fixed length of 2-bit encoding would be just as effective (ratio of ratios 1: 1: 1: 1) since you really don't need to encode the termination character.

Python Huffman Encoding Performance Measurement

More articles: