GNU Parallel: split file into children

purpose

Use GNU In parallel, split a large .gz file into children. Since the server has 16 processors, create 16 children. Each child must contain no more than N lines. Here N = 104 214 420 lines. Children must be in .gz format.

Input file

  • name: file1.fastq.gz
  • size: 39 GB
  • number of rows: 1,667,430,708 (uncompressed)

Equipment

  • 36 GB of memory
  • 16 CPU
  • HPCC environment (I'm not an administrator)

the code

Version 1

zcat "${input_file}" | parallel --pipe -N 104214420 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"

Three days later, this work was not completed. split_log.txt was empty. In the output directory, not a single child was visible. The log files indicate that Parallel increased the value --block-sizefrom 1 MB (default) to more than 2 GB. This prompted me to change my code to version 2.

Version 2

# --block-size 3000000000 means a single record could be 3 GB long. Parallel will increase this value if needed.

zcat "${input_file}" | "${parallel}" --pipe -N 104214420 --block-size 3000000000 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"

~ 2 . split_log.txt . . :

parallel: Warning: --blocksize >= 2G causes problems. Using 2G-1.

  • ?
  • ?
+4
2

, fastq 4 .

GNU -L 4.

fastq , n * 4 .

, --pipe-part, --pipe-part -L, --pipe.

zcat file1.fastq.gz | parallel -j16 --pipe -L 4 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"

, 1 , (.. 4 ). . , 16 , . , --round-robin , --resume-failed :

zcat file1.fastq.gz | parallel -j16 --pipe -L 4 --joblog split_log.txt --round-robin "gzip > ${input_file}_child_{#}.gz"

parallel 16 gzips, 100-200 /.

, fastq, , : fastq seqname, :

@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
;;;;;;;;;;;9;7;;.7;393333

@EAS54_6_R. , ( ), , , @EAS54_6_R. .

, \n, @EAS54_6_R , --pipe-part. , . 1/16 file1-fastq:

parallel -a file1.fastq --block <<1/16th of the size of file1.fastq>> -j16 --pipe-part --recend '\n' --recstart '@EAS54_6_R' --joblog split_log.txt "gzip > ${input_file}_child_{#}.gz"

GNU Parallel 20161222, GNU Parallel . --block -1 : , 16 .

parallel -a file1.fastq --block -1 -j16 --pipe-part --recend '\n' --recstart '@EAS54_6_R' --joblog split_log.txt "gzip > ${input_file}_child_{#}.gz"

GNU Parallel : 20 /.

, , , :

parallel -a file1.fastq --pipe-part --block -1 -j16 
--regexp --recend '\n' --recstart '@.*\n[A-Za-z\n\.~]'
my_command

, :

@
[A-Za-z\n\.~]
anything
anything

, "@", , [A-Za-z\n. ~], seqname , @.


, 1/16 , :

  • .
  • gzip , ( gzip, , ).

104214420 ( -N), , , , , 150 36 .

+2

: , . . n 1.r1.fastq.gz n 1.r2.fastq.gz.

split -n r/16 . . \0 , . --filter , :

doit() { perl -pe 's/\0//' | gzip > $FILE.gz; }
export -f doit
zcat big.gz | perl -pe '($.-1)%4 or print "\0"' | split -t '\0' -n r/16 --filter doit - big.

big.aa.gz.. big.ap.gz.

0

Source: https://habr.com/ru/post/1667049/


All Articles