The script in your question will be very fast, since all it does is hash the search for the current line number in the h
array. This will be faster, but if you do not want to print the last line number from reads.fastq, since it will exit after the last line number is printed, instead of continuing with the rest of reads.fastq:
awk 'FNR==NR{h[$1]; c++; next} FNR in h{print; if (!--c) exit}' takeThese.txt reads.fastq
You can add delete h[FNR];
after print;
to reduce the size of the array, and therefore MAYBE to speed up the search time, but idk if it really improves performance, since accessing the array is a hash search, and so will be very fast, so adding delete
can slow down the script as a whole.
In fact, it will be even faster, since it avoids testing NR == FNR for each line in both files:
awk -v nums='takeThese.txt' ' BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} } NR in h{print; if (!--c) exit} ' reads.fastq
Faster or faster than the @glennjackman script depends on how many lines are in the takeThese.txt file and how close they are to the end of reads.fastq. Since Glenns reads the entire reads.fastq no matter what the contents of takeThese.txt will run for approximately constant time, while mine will be significantly faster, the last line number in takeThese.txt appears farther from the end of reads.fastq, for example.
$ awk 'BEGIN {for(i=1;i<=100000000;i++) print i}' > reads.fastq
.
$ awk 'BEGIN {for(i=1;i<=1000000;i++) print i*100}' > takeThese.txt $ time awk -v nums=takeThese.txt ' function next_index() { ("sort -n " nums) | getline i return i } BEGIN { linenum = next_index() } NR == linenum { print; linenum = next_index() } ' reads.fastq > /dev/null real 0m28.720s user 0m27.876s sys 0m0.450s $ time awk -v nums=takeThese.txt ' BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} } NR in h{print; if (!--c) exit} ' reads.fastq > /dev/null real 0m50.060s user 0m47.564s sys 0m0.405s
.
$ awk 'BEGIN {for(i=1;i<=100;i++) print i*100}' > takeThat.txt $ time awk -v nums=takeThat.txt ' function next_index() { ("sort -n " nums) | getline i return i } BEGIN { linenum = next_index() } NR == linenum { print; linenum = next_index() } ' reads.fastq > /dev/null real 0m26.738s user 0m23.556s sys 0m0.310s $ time awk -v nums=takeThat.txt ' BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} } NR in h{print; if (!--c) exit} ' reads.fastq > /dev/null real 0m0.094s user 0m0.015s sys 0m0.000s
but you can have the best of both worlds:
$ time awk -v nums=takeThese.txt ' function next_index() { if ( ( ("sort -n " nums) | getline i) > 0 ) { return i } else { exit } } BEGIN { linenum = next_index() } NR == linenum { print; linenum = next_index() } ' reads.fastq > /dev/null real 0m28.057s user 0m26.675s sys 0m0.498s $ time awk -v nums=takeThat.txt ' function next_index() { if ( ( ("sort -n " nums) | getline i) > 0 ) { return i } else { exit } } BEGIN { linenum = next_index() } NR == linenum { print; linenum = next_index() } ' reads.fastq > /dev/null real 0m0.094s user 0m0.030s sys 0m0.062s
which, if we assume that takeThese.txt is already sorted, can be reduced to simple:
$ time awk -v nums=takeThese.txt ' BEGIN { getline linenum < nums } NR == linenum { print; if ((getline linenum < nums) < 1) exit } ' reads.fastq > /dev/null real 0m27.362s user 0m25.599s sys 0m0.280s $ time awk -v nums=takeThat.txt ' BEGIN { getline linenum < nums } NR == linenum { print; if ((getline linenum < nums) < 1) exit } ' reads.fastq > /dev/null real 0m0.047s user 0m0.030s sys 0m0.016s