Random search in 7z file

Question

Random search in 7z file

Is it possible to do random access (many requests) to a very large file compressed by 7zip?

The source file is very large (999gb xml), and I cannot save it in the unpacked format (I do not have much free space). Thus, if the format 7z allows you to access the middle block without unpacking all the blocks before the selected one, I can create an index of the beginning of the block and the corresponding initial file offsets.

Title of my archive 7z

37 7A BC AF 27 1C 00 02 28 99 F1 9D 4A 46 D7 EA // 7z archive version 2;crc; n.hfr offset 00 00 00 00 44 00 00 00 00 00 00 00 F4 56 CF 92 // n.hdr offset; n.hdr size=44. crc 00 1E 1B 48 A6 5B 0A 5A 5D DF 57 D8 58 1E E1 5F 71 BB C0 2D BD BF 5A 7C A2 B1 C7 AA B8 D0 F5 26 FD 09 33 6C 05 1E DF 71 C6 C5 BD C0 04 3A B6 29

UPDATE: 7z the archiver says that this file has one block of data compressed using the LZMA algorithm. The decompression speed during testing is 600 MB / s (from the unpacked data), only one CPU core is used.

+2

compression wikipedia decompression random-access 7zip

osgx Oct 24 '11 at 21:41

source share

4 answers

Cyan · Answer 1 · 2011-10-25T12:10:25+0000

This is technically possible, but if your question is “does the 7zip binary command-line tool currently available make this an answer” unfortunately not. The best thing that allows is to independently compress each file into an archive, allowing you to directly extract files. But since what you want to compress is one (huge) file, this trick will not work.

I am afraid that the only way is to split your file into small blocks and transfer them to the LZMA encoder (included in the LZMA SDK). Unfortunately, this requires some programming skills.

Note: here you can find a technically incomplete, but trivial compression algorithm. The main program does exactly what you are looking for: cut the source file into small blocks and feed them one by one to the compressor (in this case, LZ4). Then the decoder performs the inverse operation. It can easily skip all compressed blocks and go straight to the one you want to get. http://code.google.com/p/lz4/source/browse/trunk/lz4demo.c

sleeplessnerd · Answer 2 · 2011-10-25T16:09:19+0000

How about this:

Concept: because you basically only read one file, index .7z by block.

read the block of the compressed file by block, give each block a number and, possibly, an offset in a large file. scan the bindings of the target element in the data stream (for example, the names of articles on Wikipedia). For each binding record, save the number of the block in which the element started (perhaps it was in the block before)

write the index to some kind of O (log n) storage. To access, select the block number and its offset, remove the block and find the item. the cost is associated with extracting one block (or very few) and a string search in this block.

To do this, you need to read the file once, but you can transfer it and discard it after processing, so nothing gets to disk.

DARN: you basically postulated this in your question ... it seems useful to read the question before answering ...

osgx · Answer 3 · 2014-06-05T12:03:10+0000

Architect
7z says that this file has one block of data compressed using the LZMA algorithm.

What was the 7z / xz command to search for, is it a single compressed block or not? Will 7z create a multi-block (multi-threaded) archive when used with multiple threads?

The source file is very large (999gb xml)

The good news: wikipedia has switched to multi-threaded archives for its dumps (at least for enviks): http://dumps.wikimedia.org/enwiki/

For example, the most recent dump http://dumps.wikimedia.org/enwiki/20140502/ has bzip2 multithreading (with a separate index "offset: export_article_id: article_name") and 7z is stored in many sub-GB archives with ~ 3k (?) Articles in archive:

Articles, templates, file / file descriptions and primary meta pages, in multiple bz2 streams, 100 pages per stream

 enwiki-20140502-pages-articles-multistream.xml.bz2 10.8 GB enwiki-20140502-pages-articles-multistream-index.txt.bz2 150.3 MB

All pages with full edit history (.7z)

 enwiki-20140502-pages-meta-history1.xml-p000000010p000003263.7z 213.3 MB enwiki-20140502-pages-meta-history1.xml-p000003264p000005405.7z 194.5 MB enwiki-20140502-pages-meta-history1.xml-p000005406p000008209.7z 216.1 MB enwiki-20140502-pages-meta-history1.xml-p000008210p000010000.7z 158.3 MB enwiki-20140502-pages-meta-history2.xml-p000010001p000012717.7z 211.7 MB ..... enwiki-20140502-pages-meta-history27.xml-p041211418p042648840.7z 808.6 MB

I think we can use the bzip2 index to evaluate the article id even for 7z dumps, and then we only need the 7z archive with the correct range (..p first_id p last_id.7z). stub-meta-history.xml can also help.

Landfill FAQs: http://meta.wikimedia.org/wiki/Data_dumps/FAQ

Nosomy · Answer 4 · 2019-02-03T08:46:24+0000

Use only:

 7z e myfile_xml.7z -so | sed [something]

Example of getting line 7:

7z e myfile_xml.7z -so | sed -n 7p

Random search in 7z file

More articles: