Choosing a compression algorithm to implement

I was given some coursework to implement the compression algorithm of my choice. It can be any language, however, best of all I know Java, and then C. It will be evaluated based on -

  • The decompressed output should match the original input, so I can only look at less loss algorithms.

  • The execution time should be proportional to the length of the message.

  • The memory requirement should not depend on the length of the message.

Our implementations will be tested as follows:

  • Standard text file

  • Binary file with byte values ​​from 0 to 255

  • Large file ~ 10 mb of unspecified content.

My initial thought is to use dynamic arithmetic coding, but I wonder if there is an algorithm more suitable for the restrictions above? Secondly, is it better to do this in C rather than in Java? I ask about this because I think C will have less memory, but I'm not sure if that is true.

I spent some time on this question, and several sites mention LZW coding in combination with Huffman dynamic coding. Would that be a smart way? Our lecturer warned us that 90% of the views that have tried dynamic Huffman coding for many years were incorrectly implemented.

However, I am not afraid to try, but I would appreciate some opinions before starting.

Any feedback would be highly appreciated.

+4
source share
3 answers

It's just that LZW, without any other coding, is pretty damn simple and works surprisingly well. Currently, no one will use LZW, as there are other algorithms that can be compressed faster. However, for the quest, you cannot defeat the simplicity of LZW. No Huffman, dynamic or otherwise. No Shannon Fano. No arithmetic or rangefinding coding. And yes, memory usage is independent of message length. Mark Nelson wrote a very good explanation .

You can do this in C or Java, although C may be less error prone since it has unsigned types.

+4
source

To answer my question, Shannon Fano must be “good enough” for such a task. If you've never done anything in the field of data compression, I suggest avoiding Huffman coding (or specialized versions of arithmetic coding).

According to this , SF meets your space / time requirements. I suggest implementing something like first . Pseudocode is defined as follows:

1: begin 2: count source units 3: sort source units to non-decreasing order 4: SF-SplitS 5: output(count of symbols, encoded tree, symbols) 6: write output 7: end 8: 9: procedure SF-Split(S) 10: begin 11: if (|S|>1) then 12: begin 13: divide S to S1 and S2 with about same count of units 14: add 1 to codes in S1 15: add 0 to codes in S2 16: SF-Split(S1) 17: SF-Split(S2) 18: end 19: end 

Only if you fully understand SF (or you used similar algorithms earlier), I would suggest moving on to more stringent arithmetic coding methods. I recently implemented SF for the information class theory, and some of its parts seemed unintuitive and strange until I hugged it. On paper, it looks simple, but (like many other algorithms) that can cheat.

If you don’t get extra “style points,” I’ll personally go to Shannon Fano.

+1
source

Memory limitation involves the use of some kind of adaptive encoding. Arithmetic coding is nice. But you did not indicate anything about performance. Should the algorithm hit any performance targets in a particular file class? An algorithm that simply copies a file meets the above requirements (but it doesn’t teach you much).

To choose a language, use what is more convenient for you. There will be many bit manipulations to execute, so C or Java will fit together. You must write code that processes files into bit streams and vice versa and does this as a separate module. I could see this in C or Java.

+1
source

Source: https://habr.com/ru/post/1469386/


All Articles