The boundaries of practical compression and Kolmogorov complexity

I consider the use of compression as a way to measure the relationship of a document to the body of documents. However, I found a strange result when using bzip2; len (compress (corpus))> len (compress (corpus + new_document)). Should this be the case with a practical compression algorithm, and theoretically this is possible if you look at the complexity of Kolmogorov's data? (the idea is to use a compression algorithm to approximate data complexity)

+3
source share
2 answers

Yes, this should be the case with a practical compression algorithm, and theoretically possible with Kolmogorov complexity . The easiest way to explain why this is an example.

Assume the following:

  • your document separator ,
  • corpus are documents abc, def, abc, def, abc, def, abc,
  • new document - def
  • your compression algorithm (or the kolmogorov description language) simply allows you to repeat using a prefix with repeat count followed by |(this is a run-length encoding option )

Then:

  • compress (corpus) = "3 | abc, def, 1 | abc"
  • compress (corpus + new_document) = "4 | abc, def,"

, compress(corpus) compress(corpus+new_document). , , , , . , bzip2, , .

, . , , , , , .

+4

, .

, , , .

, : x y, x - y. , ,

x = "asdfasdfasdfasdfasdfasdfasdfasdfasdf"

y = "asdfasdfasdfasdfasdfasdfasdfasdfasdf23452345234523452344523452452345234524345234"

, D - y. (I.e. K (y) = | D |)

x | ", D 46 " |, D, ( ).

x, , K (x) <= K (y) + log (| y | - | x |)

, , .

(N.b.: RLE , RLE Turing, .)

+1

Source: https://habr.com/ru/post/1785167/


All Articles