Group gzipped files

I have 40 files of 2 GB each, which are stored in the NFS architecture. Each file contains two columns: a numeric identifier and a text field. Each file is already sorted and gzipped.

How can I combine all these files so that the resulting result is also sorted?

I know that sort -m -k 1 should do the trick for uncompressed files, but I don't know how to do this directly with compressed files.

PS: I do not need a simple solution for unpacking files to disk, merging and compressing them, because I do not have enough disk space for this.

+6
source share
4 answers

This is a use case for replacing a process. Let's say you have two files to sort, sorta.gz and sortb.gz . You can give gunzip -c FILE.gz for sorting for both of these files using the shell operator <(...) :

 sort -m -k1 <(gunzip -c sorta.gz) <(gunzip -c sortb.gz) >sorted 

Process substitution replaces a command with a file name that represents the result of this command, and is usually implemented either with a named pipe or with a special file /dev/fd/...

For 40 files, you will need to dynamically create a command using many process substitutions and use eval to execute it:

 cmd="sort -m -k1 " for input in file1.gz file2.gz file3.gz ...; do cmd="$cmd <(gunzip -c '$input')" done eval "$cmd" >sorted # or eval "$cmd" | gzip -c > sorted.gz 
+12
source
  #!/bin/bash FILES=file*.gz # list of your 40 gzip files # (eg file1.gz ... file40.gz) WORK1="merged.gz" # first temp file and the final file WORK2="tempfile.gz" # second temp file > "$WORK1" # create empty final file > "$WORK2" # create empty temp file gzip -qc "$WORK2" > "$WORK1" # compress content of empty second # file to first temp file for I in $FILES; do echo current file: "$I" sort -k 1 -m <(gunzip -c "$I") <(gunzip -c "$WORK1") | gzip -c > "$WORK2" mv "$WORK2" "$WORK1" done 

Filling out $ FILES is the easiest way: a list of files with bash globbing (* .gz file) or a list of 40 file names (separated by white spaces). Your files in $ FILES remain unchanged.

Finally, 80GB data is compressed in $ WORK1. When processing this script, there is no uncompressed data written to disk.

+2
source

Adding merge of several files to different ends of one pipeline - it accepts all (pre-sorted) files in $OUT/uniques , sorts - combines them and compresses the output, lz4 is used because of its speed:

 find $OUT/uniques -name '*.lz4' | awk '{print "<( <" $0 " lz4cat )"}' | tr "\n" " " | (echo -n sort -m -k3b -k2 " "; cat -; echo) | bash | lz4 \ > $OUT/uniques-merged.tsv.lz4 
+1
source

True, there are zgrep and other general utilities that play with compressed files, but in this case you need to sort / combine uncompressed data and compress the result.

-1
source

Source: https://habr.com/ru/post/971730/


All Articles