What is the best compression algorithm that allows you to arbitrarily read / write to a file?

Question

What is the best compression algorithm that allows you to arbitrarily read / write to a file?

I know that no adaptive compression algorithms are out of the question.

And I know that Huffman coding is out of the question.

Does anyone have a better compression algorithm that would allow random read / write?

I think you can use any compression algorithm if you write it in blocks, but ideally I would not want to unpack the whole block at a time. But if you have suggestions for an easy way to do this and find out the boundaries of the blocks, please let me know. If this is part of your solution, please tell me what you are doing when the data you want to read is across the block boundary?

In the context of your answers, please assume that the file in question is 100 GB, and sometimes I want to read the first 10 bytes, and sometimes I want to read the last 19 bytes, and sometimes I want to read 17 bytes in the middle. .

+19

compression huffman-code random-access

Brian R. Bondy Oct. 25 '08 at 13:38

source share

7 answers

David Cary · Answer 1 · 2010-08-08 05:20

I am stunned by the number of answers that imply that such a thing is impossible.

Have these people heard of the “compressed file systems” that have existed since Microsoft sued Stac Electronics for compressed file system technology in 1993?

I heard that LZS and LZJB are popular algorithms for people implementing compressed file systems, which necessarily require both random access reading and random access writing.

Perhaps the simplest and best thing to do is enable file system compression for this file and let the OS handle the details. But if you insist on manually processing it, you might be able to get some tips by reading the transparent NTFS file compression .

Also check: stack overflow

Erik Aronesty · Answer 2 · 2011-08-23 15:23

The razip format supports random access reading with better performance than gzip / bzip2, which must be configured for this support:

http://sourceforge.net/projects/razip/

Stephen Denne · Answer 3 · 2008-11-06 10:03

A dictionary-based compression scheme in which each dictionary entry code is encoded with the same size will make it possible to start reading from any multiple code size, and writing and updating will be easy if the codes are not used by their context / neighbors.

If the encoding includes a way to recognize the beginning or end of codes, you do not need codes of the same length, and you can start reading somewhere in the middle of the file. This method is more useful if you are reading from an unknown position in the stream.

Daren Thomas · Answer 4 · 2010-07-30 14:56

I think Stephen Denn might be here. Imagine:

zip-like sequence compression for codes
dictionary display code → sequence File
will look like a file system
- each record generates a new "file" (a sequence of bytes compressed according to the dictionary)
- "file system" keeps track of which "file" belongs to those bytes (start, end)
- each "file" is compressed according to the dictionary
- reads work on a file, unpacks and extracts bytes in accordance with the "file system"
- entries invalidate "files"; new "files" are added to replace invalid
for this system you will need:
- file system defragmentation mechanism
- compact dictionary from time to time (removal of unused codes)
done correctly, the household can be done when no one is looking for (simple), or by creating a new file and "switching" in the end

One positive effect will be that the dictionary will be applied to the entire file. If you can save processor cycles, you can periodically check the sequence that overlaps the boundaries of the "file", and then rearrange.

This idea is for really random readings. If you ever read a fixed-size record, some parts of this idea may become easier.

Ferruccio · Answer 5 · 2008-10-25 13:51

I do not know any compression algorithm that allows you to read randomly, not to mention random writing. If you need this ability, the best option would be to compress the file in chunks, rather than in general.

eg.
First we look at the read-only case. Let's say you split your file into 8K pieces. You compress each piece and save each compressed fragment in sequence. You will need to write down where each compressed piece is stored and how big it is. Then say that you need to read N bytes, starting at offset O. You will need to figure out which one it is (O / 8K), unzip this fragment and capture these bytes. The data you need can span several pieces, so you have to deal with this scenario.

Everything becomes more complicated if you want to write to a compressed file. You have to deal with compressed pieces, more and less. You may need to add an additional addition to each fragment if it expands (it is all the same size uncompressed, but different data will be compressed to different sizes). You may even have to move the pieces if the compressed data is too large to fit back into the original space that it was given.

This is basically how compressed file systems work. You might be better off turning on file system compression for your files and just reading / writing for them normally.

afeldspar · Answer 6 · 2008-11-04 04:35

Compression is all about removing redundancy from data. Unfortunately, it is unlikely that redundancy will be distributed with monotonous uniformity throughout the file, and this is the only scenario in which you can expect compression and fine-grained random access.

However, you can approach random access by maintaining an external list created during compression that shows the correspondence between the selected points in the uncompressed data stream and their locations in the compressed data stream. You obviously need to choose a method in which the translation scheme between the original stream and its compressed version does not depend on the location in the stream (i.e. No LZ77 or LZ78, instead, you probably want to go for Huffman or byte-pair coding .) Obviously, this will entail a lot of overhead, and you will need to decide how you would like to trade between the disk space needed for the “bookmark points” and the processor time needed to decompress the stream, starting from to get the data, which you are really looking for for eniya.

As for a random access record ... it's almost impossible. As already noted, compression involves removing redundancy from data. If you try to replace data that could have been compressed because it was redundant with data that does not have the same redundancy, it just doesn't work.

However, depending on how many random access records you are going to make, you can simulate it by supporting a sparse matrix representing all the data written to the file after compression. In all readings, you check the matrix to see if you read the area you wrote after compression. If not, you will go to the compressed file for the data.

Hugh Allen · Answer 7 · 2008-10-25 14:03

No compression scheme allows fine-grained random access for two reasons:

You cannot know exactly how far your desired piece of data is in the compressed file, therefore
there is no way to know where the character begins (in which bit of the position for Huffman, worse for arithmetic coding).

I can only offer to treat the file as a broadcast stream and insert frequent synchronization marks / positions with obvious overhead (synchronization marks not only take up space on their own, but also complicate the encoding, since it should avoid "random" synchronization marks!). Alternatively, and to avoid looking for something like a binary search (with optimizations that you can better understand where to start than in the middle), you can include a table of contents at the beginning or end of the file.

As for a random access record ... I cannot come up with any neat solution :(

What is the best compression algorithm that allows you to arbitrarily read / write to a file?

More articles: