Compression formats with good support for random access in archives?

Question

Compression formats with good support for random access in archives?

This is similar to the previous question , but the answers there do not satisfy my needs, and my question is slightly different:

I am currently using gzip compression for some very large files that contain sorted data. When files are not compressed, binary search is a convenient and efficient way to support location searches in sorted data.

But when files are compressed, things get more complicated. I recently learned about the zlib parameter Z_FULL_FLUSH , which can be used during compression to insert "synchronization points" into the compressed output (then inflateSync() can start reading from various points in the file). This is normal, although the files that I already have would need to be recompressed to add this function (and it is strange that gzip does not have an option for this, but I am ready to write my own compression program if necessary).

From one source it seems that even Z_FULL_FLUSH not an ideal solution ... it is not only not supported by all gzip archives, but the very idea of detecting synchronization points in archives can give false positives (either by coincidence with a magic number for a synchronization point or because Z_SYNC_FLUSH also creates synchronization points, but they cannot be used for random access).

Is there a better solution? I would like to avoid supporting files for indexing, if possible, and explicit default support for quasi-random access would be useful (even if it is coarse-grained - for example, the ability to start reading at each 10 MB interval). Is there a different compression format with better random read support than gzip?

Edit : As I said, I want to do a binary search in compressed data. I do not need to search for a specific (uncompressed) position - just search with some rough detail in a compressed file. I just need support for something like "Unzip data starting at about 50% (25%, 12.5%, etc.) of the path to this compressed file."

+52

gzip compression archive zlib random-access

John Zwinck Jan 09 '09 at 22:29

source share

13 answers

Take a look at the dictzip . It is gzip compatible and allows rough random access.

Excerpt from the page of his manual:

dictzip compresses files using the gzip (1) (LZ77) algorithm, thus being fully compatible with the gzip file format. The extension gzip file format (an additional field described in section 2.3.1.1 of RFC 1952) provides additional data for storage in the header of a compressed file. Programs like gzip and zcat will ignore this extra data. However, [dictzcat -start] will use this data to perform pseudo-random file access.

I have a dictzip package in Ubuntu. Or its source code is in dictd - *. Tar.gz. Its license is the GPL. You can study it.

Update:

I improved dictzip to not have a file size limit. My implementation is under the MIT license.

+32

Ivo Danihelka Oct 24 '10 at 19:48

source share

The .xz file format (which uses LZMA compression) seems to support this:

Random access reading . Data can be divided into blocks with independent compression. Each .xz file contains an index of blocks, which allows limited random access reading when the block size is small enough.

That should be enough for your purpose. The disadvantage is that the liblzma API (for interacting with these containers) does not look so documented, so it may take some effort to figure out how to access blocks randomly.

+9

AardvarkSoup May 03 '14 at 11:53

source share

There are solutions to provide random access to the gzip and bzip2 archives:

gzip zran.c from ghostscript source code
bzip2 look-bzip by james taylor

( I'm looking for something for 7zip )

+7

hippietrail Dec 17 '10 at 1:42

source share

bgzip can compress files in the gzip variant, which is indexed (and can be unpacked to gzip ). This is used in some bioinformatics applications along with the tabix index.

See explanations here: http://blastedbio.blogspot.fr/2011/11/bgzf-blocked-bigger-better-gzip.html , and here: http://www.htslib.org/doc/tabix.html .

I don’t know how much it adapts to other applications.

+4

bli Feb 04 '16 at 13:11

source share

I'm not sure if this will be practical in your particular situation, but could you just gzip each large file into smaller files, say 10 MB each? As a result, you will have many files: file0.gz, file1.gz, file2.gz, etc. Based on the specified offset in the large original, you can search in a file named "file" + (offset / 10485760) + ".gz" . The offset in the uncompressed archive will be offset % 10485760 .

+3

William Brendel Jan 09 '09 at 22:41

source share

Since lossless compression works better in some areas than others, if you store compressed data in BLOCKSIZE blocks of convenient length, although each block has exactly the same number of compressed bytes, some compressed blocks will expand to a much longer piece of plaintext than others .

You can look at “Compression: The Key for Next Generation Search Engines” by Nivio Ziviani, Edlen Silva de Mura, Gonzalo Navarro and Ricardo Baeza-Yates in Computer Journal, November 2000 http://doi.ieeecomputersociety.org/10.1109/2.881693

Their decompressor takes 1, 2 or 3 whole bytes of compressed data and decompresses (using a list of words) into a whole word. You can directly search for compressed text for words or phrases, which is even faster than finding uncompressed text.

Their decompressor allows you to specify any word in the text using a regular (byte) pointer and immediately start unpacking from this point.

You can give each word a unique 2-byte code, since you probably have less than 65,000 unique words in the text. (The KJV Bible has nearly 13,000 unique words). Even if there are more than 65,000 words, it is quite simple to assign the first 256 double-byte code words “words” for all possible bytes, so you can spell out words that are not included in the vocabulary of 65,000 or so “the most common words and phrases”. (Compression obtained by packing common words and phrases in two bytes usually costs "sprawling" occasionally spelling out a word using two bytes per letter). There are many ways to choose a vocabulary of “common words and phrases" that will provide adequate compression. For example, you can configure the LZW compressor to dump “phrases”, which it uses several times, into the vocabulary file, one line per phrase and run it on all your data. Or you can arbitrarily split uncompressed data into 5 byte phrases in the lexicon file, one line per phrase. Or you can cut your uncompressed data into real English words and put each word, including a space at the beginning of the word, in the lexicon file. Then use "sort -unique" to eliminate duplicate words in this vocabulary file. (Gathers the perfect “optimal” vocabulary for vocabulary, which is still considered NP-hard?)

Save the vocabulary at the beginning of your huge compressed file, set it aside for some convenient BLOCKSIZE, and then save the compressed text - a series of two byte words - from there to the end of the file. Presumably, the search engine will read this vocabulary once and save it in some quick decoding in RAM during decompression to speed up the decompression of the “double-byte code” to “variable-length phrases”. My first project will start with a simple line on the phrase list, but later you can switch to saving the lexicon in a more concise form using some kind of incremental encoding or zlib.

You can select an arbitrary random byte offset into the compressed text and start unpacking it. I do not think that it is possible to create a compressed format of a compressed file with random access.

+3

David Cary Aug 07 '10 at 20:52

source share

Two possible solutions:

Let the OS handle the compression, create and mount the compressed file system (SquashFS, clicfs, cloop, cramfs, e2compr, or something else) containing all your text files and do nothing with compression in your application.
Use clicfs directly for each text file (one clicfs for a text file) instead of compressing the file system image. Think of the "mkclicfs mytextfile mycompressedfile" as the "gzip <mytextfile> mycompressedfile" and the "clicfs mycompressedfile directory" as a way to randomly access data through the directory / mytextfile file.

+3

Joachim Wagner Feb 10 2018-12-12T00:

source share

I do not know if it has been mentioned yet, but the Kiwi project has done a great job in this regard. Thanks to their Kiwix program, they offer random access to ZIM archive files. Good compression too. The project arose when there was a need for offline copies of Wikipedia (which reached 100 GB in uncompressed form, with all media included). They successfully took a 25-gigabyte file (a single-user Wikipedia version without most of the media) and compressed it to a tiny 8-gig zim archive. And through the Kiwix program, you can call up any Wikipedia page with all the associated data faster than you can surf the net.

Although Kiwix is a technology based on the wikipedia database structure, it proves that you can have excellent compression ratios and random access at the same time.

+1

CogitoErgoCogitoSum Apr 08 '13 at 3:14

source share

This is a very old question, but it looks like zindex can be a good solution (although I do not have much experience with it)

+1

robochat Sep 04 '15 at 7:26

source share

razip supports random access with better performance than gzip / bzip2, which must be configured for this support - reduction in compression due to "normal" random access:

http://sourceforge.net/projects/razip/

0

Erik Aronesty Aug 23 '11 at 15:07

source share

I am the author of an open source tool for compressing a certain type of biological data. This tool, called starch , splits the data on the chromosome and uses these divisions as indexes for quick access to compressed units of data in a larger archive.

Perchromosome data is converted to remove redundancy in the genomic coordinates, and the converted data is compressed using either bzip2 or gzip algorithms. Offsets, metadata, and compressed genomic data are combined into a single file.

Source code is available on our github site. We compiled it under Linux and Mac OS X.

In your case, you can save the offsets (10 MB or something else) in the header in the user archive format. You parse the header, extract the offsets and incrementally fseek through the file with current_offset_sum + header_size .

0

Alex Reynolds Oct 26 2018-11-21T00:

source share

The gzip format can be accessed randomly, provided that an index was previously created, as demonstrated in the source code of zlib zran.c.

I developed a command line tool for zlib zran.c that creates indexes for gzip files: https://github.com/circulosmeos/gztool

It can even create an index for the still growing gzip file (for example, a log created by rsyslog directly in gzip format), thereby reducing the time it takes to create the index in practice. See -S (Supervise).

0

circulosmeos Jul 24 '19 at 21:10

source share

jpalecek · Accepted Answer · 2009-01-09 23:19

I don’t know of any compressed file format that will support random access to a specific location in uncompressed data (well, except for multimedia formats), but you can brew your own.

For example, bzip2 compressed files consist of independent compressed blocks <1MB in size without compression, which are limited by sequences of magic bytes, so you can parse the bzip2 file, get the block boundaries, and then just unzip the right block. This will require some indexing to remember where it starts blocks.

However, I believe that the best solution would be to split the file into pieces of your choice, and then compress it with some kind of archiver, such as zip or rar, which support random access to individual files in the archive.

Compression formats with good support for random access in archives?

Update:

More articles: