Print many specific lines from a text file using an index file

Question

Print many specific lines from a text file using an index file

I have a large text file with over 100 million lines called reads.fastq . Next, I have another file called takeThese.txt that contains the line numbers from the reads.fastq file to be printed (one per line).

I am currently using

awk 'FNR == NR { h[$1]; next } (FNR in h)' takeThese.txt reads.fastq > subsample.fastq

Apparently, this takes a very long time. Is there a way to extract lines from a text file using line numbers stored in another file? Will speed accelerate if the takeThese.txt file is sorted?

EDIT:

A few lines of lines from the files that I have:

reads.fastq :

 @HWI-1KL157:36:C2468ACXX TGTTCAGTTTCTTCGTTCTTTTTTTGGAC + @@@DDDDDFF> FFGGC@F ?HDHIHIFIGG @HWI-1KL157:36:C2468ACXX CGAGGCGGTGACGGAGAGGGGGGAGACGC + BCCFFFFFHHHHHIGHHIHIJJDDBBDDD @HWI-1KL157:36:C2468ACXX TCATATTTTCTGATTTCTCCGTCACTCAA

takeThese.txt :

 5 6 7 8

To make the result look like this:

 @HWI-1KL157:36:C2468ACXX CGAGGCGGTGACGGAGAGGGGGGAGACGC + BCCFFFFFHHHHHIGHHIHIJJDDBBDDD

EDIT: Comparison of suggested scenarios:

 $ time perl AndreasWederbrand.pl takeThese.txt reads.fastq > /dev/null real 0m1.928s user 0m0.819s sys 0m1.100s $ time ./karakfa takeThese_numbered.txt reads_numbered.fastq > /dev/null real 0m8.334s user 0m9.973s sys 0m0.226s $ time ./EdMorton takeThese.txt reads.fastq > /dev/null real 0m0.695s user 0m0.553s sys 0m0.130s $ time ./ABrothers takeThese.txt reads.fastq > /dev/null real 0m1.870s user 0m1.676s sys 0m0.186s $ time ./GlenJackman takeThese.txt reads.fastq > /dev/null real 0m1.414s user 0m1.277s sys 0m0.147s $ time ./DanielFischer takeThese.txt reads.fastq > /dev/null real 0m1.893s user 0m1.744s sys 0m0.138s

Thanks for all the suggestions and efforts!

+6

unix bash awk sed

Daniel Fischer Nov 14 '16 at 19:05

source share

7 answers

I think the solution in the question stores all the lines from takeThese.txt to array, h [], and then for each line in reads.fastq does a linear search in h [] for that line number.

There are several simple enhancements in different languages. I would try perl if you don't like java.

Basically, you have to make sure that the takeThese.txt file is sorted, and then just look at reads.fastq one line at a time during the scan, checking the line number that matches the next line number from the takeThese.txt file, then put this and continue.

Since the lines have different lengths, you have no choice but to scan a newline character (basic for each line construction in most languages).

Perl example, quick and dirty, but works

 open(F1,"reads.fastq"); open(F2,"takeThese.txt"); $f1_pos = 1; foreach $index (<F2>) { while ($f1_pos <= $index) { $out = <F1>; $f1_pos++; } print $out; }

+2

Andreas Wederbrand Nov 14 '16 at 19:37

source share

The problem that I see with your awk is that you load all the line numbers you want to extract into the array, and then for each individual line you need to access that array.

I am sure that the in keyword should do something along the lines of the loop for each element of the array and compare the value of this index with the value of FNR ...

So, if you have 1,000,000 lines that you want to extract, for each individual reads.fastq line you need to reads.fastq over the 1,000,000 lines that you want to extract! 100,000,000 (reads.fastq lines) X 1,000,000 (search query array length) = 1e+14 . This is a lot of searching.

Again awks in keyword can do all kinds of fancy tricks and effective things, but in the end you should understand why this doesn't work.

One approach is to use a variable that contains the current row that we want, an index variable to track where we are in the lookup array, and a maximum variable to see if we can stop processing the file! Thus, we only perform N array searches, one for each row we want, and the rest of the time we do a FNR comparison with a variable that should be faster. In addition, we stop execution after we print the last line we need.

Obviously, this requires that we have a sorted list of the rows that we want to extract.

readthese is your "takeThese.txt" . list.txt is a file in which lines are numbered 1 - 1,000,000`

 awk 'BEGIN{i=1; max=1;} FNR==NR{ if($1 != ""){h[max]=$1; max++; next}} { if(!l){l=h[i]; i++; } if( FNR == l ){ print $0; l=h[i]; i++; if(i == max){ exit; } } }'

In a more readable format

  awk ' BEGIN{i=1; max=1;} FNR==NR{ if($1 != ""){ h[max]=$1; max++; next } } { if(!l){ l=h[i]; i++; } if( FNR == l ){ print $0; l=h[i]; i++; if(i == max){ exit; } } }' readthese list.txt

i is our current location in the h array, in which we store the rows we want to extract. max is basically the length of the array h , when i == h we know that we can stop. l is the value of the next line we want to extract.

EDIT: you can replace readthese with <(sort -n readthese) if you need to sort the line file.

+2

A brothers Nov 14 '16 at 19:43

source share

I would try one of these

may lead to false positives:

 cat -n reads.fastq | grep -Fwf takeThese.txt | cut -d$'\t' -f20

one of {bash, ksh, zsh} is required:

 sed -n -f <(sed 's/$/p/' takeThese.txt) reads.fastq

this is similar to the answer by Andreas Wederbrand perl implemented in awk

 awk -v nums=takeThese.txt ' function next_index() { ("sort -n " nums) | getline i return i } BEGIN { linenum = next_index() } NR == linenum { print; linenum = next_index() } ' reads.fastq

But, if you are dealing with a lot of data, word processing tools will take time. Another option is to import data into the correct database and use SQL to retrieve it: for this, materials for databases are used.

+2

glenn jackman Nov 14 '16 at 20:14

source share

Since I have the flu and I'm bored, I tested some approaches to try to speed up the original solution. Test Files:

 $ awk 'BEGIN {for(i=1;i<=100000000;i++) print i}' > reads.fastq

and

 $ awk 'BEGIN {for(i=1;i<=1000000;i++) print i*100}' > takeThat.txt

The first file is just digits from 1 to 100000000. It does not represent real data, but I was curious about the runtime between awk solutions, so I assumed that real data would only multiply the result time with a constant (until memory starts to run out).

The second file represents one percent of the evenly distributed hit in the first file:

 100 200 300 ...

First up is the original OP script:

 $ time awk 'FNR==NR {h[$1]; next} (FNR in h)' takeThese.txt reads.fastq > /dev/null real 0m52.901s user 0m52.596s sys 0m0.284s

My decision:

 BEGIN { j=1 while((getline a[++i] < "takeThese.txt") > 0 ); # row numbers to a } NR<a[j] { next } # skip rows before next match j++ # print and iterate j j==i { exit } # exit after last hit

Urgent launch:

 $ time awk -f program.awk reads.fastq > /dev/null real 0m25.894s user 0m25.676s sys 0m0.208s

The serial number of takeThese.txt .

+2

James brown Nov 15 '16 at 13:32

source share

late, but it can be a quick alternative. It uses join raw speed, but you need to convert the corresponding fields to lexicographic order.

 $ join <(awk '{printf "%09d\n", $1}' pick.list) <(nl -w9 -ba -nrz big.file) | cut -d' ' -f2-

pre-process your selection list to add leading zeros, add line numbers to your large file with leading zeros (of the same width), assumes that your selection list is in numerical order, otherwise sort first.

Change the file names for "pick.list" and "big.file" with your own file names. Also, if the large file has more than 999,999,999 lines, adjust the width accordingly ("% 09" and "w9").

If you try this, post your timings. I guess this will be much faster than the awk alternatives.

Nl options

w9 width of the number is 9 ba add numbers to empty lines, as well as to the text body
nrz format number with leading zeros, right justification, i.e. 000000001

+2

karakfa Nov 15 '16 at 15:48

source share

Are the reads.fastq same length?

If so, a simple Java algorithm or any other language can take every number of lines in takeThese.txt and find the position at which the line starts in reads.fastq , multiplying the line number by the line length.

If not, then the only way to find the right line is to count the newlines, which means reading each character. It can be faster than awk, and it definitely helps to sort line numbers.

0

Wiebe Nov 14 '16 at 19:15

source share

Ed morton · Accepted Answer · 2016-11-15T00:28:42+0000

The script in your question will be very fast, since all it does is hash the search for the current line number in the h array. This will be faster, but if you do not want to print the last line number from reads.fastq, since it will exit after the last line number is printed, instead of continuing with the rest of reads.fastq:

 awk 'FNR==NR{h[$1]; c++; next} FNR in h{print; if (!--c) exit}' takeThese.txt reads.fastq

You can add delete h[FNR]; after print; to reduce the size of the array, and therefore MAYBE to speed up the search time, but idk if it really improves performance, since accessing the array is a hash search, and so will be very fast, so adding delete can slow down the script as a whole.

In fact, it will be even faster, since it avoids testing NR == FNR for each line in both files:

 awk -v nums='takeThese.txt' ' BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} } NR in h{print; if (!--c) exit} ' reads.fastq

Faster or faster than the @glennjackman script depends on how many lines are in the takeThese.txt file and how close they are to the end of reads.fastq. Since Glenns reads the entire reads.fastq no matter what the contents of takeThese.txt will run for approximately constant time, while mine will be significantly faster, the last line number in takeThese.txt appears farther from the end of reads.fastq, for example.

 $ awk 'BEGIN {for(i=1;i<=100000000;i++) print i}' > reads.fastq

.

 $ awk 'BEGIN {for(i=1;i<=1000000;i++) print i*100}' > takeThese.txt $ time awk -v nums=takeThese.txt ' function next_index() { ("sort -n " nums) | getline i return i } BEGIN { linenum = next_index() } NR == linenum { print; linenum = next_index() } ' reads.fastq > /dev/null real 0m28.720s user 0m27.876s sys 0m0.450s $ time awk -v nums=takeThese.txt ' BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} } NR in h{print; if (!--c) exit} ' reads.fastq > /dev/null real 0m50.060s user 0m47.564s sys 0m0.405s

.

 $ awk 'BEGIN {for(i=1;i<=100;i++) print i*100}' > takeThat.txt $ time awk -v nums=takeThat.txt ' function next_index() { ("sort -n " nums) | getline i return i } BEGIN { linenum = next_index() } NR == linenum { print; linenum = next_index() } ' reads.fastq > /dev/null real 0m26.738s user 0m23.556s sys 0m0.310s $ time awk -v nums=takeThat.txt ' BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} } NR in h{print; if (!--c) exit} ' reads.fastq > /dev/null real 0m0.094s user 0m0.015s sys 0m0.000s

but you can have the best of both worlds:

 $ time awk -v nums=takeThese.txt ' function next_index() { if ( ( ("sort -n " nums) | getline i) > 0 ) { return i } else { exit } } BEGIN { linenum = next_index() } NR == linenum { print; linenum = next_index() } ' reads.fastq > /dev/null real 0m28.057s user 0m26.675s sys 0m0.498s $ time awk -v nums=takeThat.txt ' function next_index() { if ( ( ("sort -n " nums) | getline i) > 0 ) { return i } else { exit } } BEGIN { linenum = next_index() } NR == linenum { print; linenum = next_index() } ' reads.fastq > /dev/null real 0m0.094s user 0m0.030s sys 0m0.062s

which, if we assume that takeThese.txt is already sorted, can be reduced to simple:

 $ time awk -v nums=takeThese.txt ' BEGIN { getline linenum < nums } NR == linenum { print; if ((getline linenum < nums) < 1) exit } ' reads.fastq > /dev/null real 0m27.362s user 0m25.599s sys 0m0.280s $ time awk -v nums=takeThat.txt ' BEGIN { getline linenum < nums } NR == linenum { print; if ((getline linenum < nums) < 1) exit } ' reads.fastq > /dev/null real 0m0.047s user 0m0.030s sys 0m0.016s

Print many specific lines from a text file using an index file

More articles: