I have the following line that I would like to encode Huffman and store efficiently in a bitmap:
>>> print sequence GTCAGGACAAGAAAGACAANTCCAATTNACATTATG|
The symbol frequencies in the sequence are:
>>> print freqTuples [(0.40540540540540543, 'A'), (0.1891891891891892, 'T'), (0.16216216216216217, 'C'), (0.16216216216216217, 'G'), (0.05405405405405406, 'N'), (0.02702702702702703, '|')]`
I translated this into a Huffman code dictionary:
>>> print codeDict {'A': '1', 'C': '010', 'G': '001', 'N': '0110', 'T': '000', '|': '0111'}
Then I used the Python bitstring package to translate a string, character by character, into an instance of the BitArray class, which I call BitArray , which contains bits for each character encoded with the corresponding Huffman code
>>> print bitArray.bin 0b001000010100100110101100111100110101101100000100101100000001101010100000010000010111
Here is the array bit in bytes:
>>> print bitArray.tobytes() !I\254\363[^D\260^Z\240Ap
I have to use tobytes() instead of bytes , since the tobytes() generated by the multiplier does not divide evenly into 8-bit segments.
When I calculate the storage efficiency of the BitArray (the ratio of the sizes of the bit array and the input string), I get worse performance than if I left the input string unoccupied:
>>> sys.getsizeof(bitArray.tobytes()) / float(len(sequence)) 1.2972972973
Am I measuring storage efficiency correctly? (If I code longer input lines, this ratio improves, but seems to be approaching an asymptotic limit of about 0.28. I would like to confirm whether this can be measured correctly.)
Edit
The following two approaches give different answers:
>>> print len(bitArray.tobytes()) / float(len(mergedSequence)) 0.297297297297 >>> print bitArray.len / (8.*len(mergedSequence)) 0.283783783784
I'm not sure what to believe. But in the process of writing data to the storage, I think I need a representation of bytes, which makes me inclined to choose the first result.