I want to do this kind of work, but have run into difficulties:
I have a huge text file. Each line has a format "AGTCCCGGAT filename"where the first part is a DNA thing.
The professor offers us to split this huge file into many temporary files and use heapq.merge()them to sort them. The goal is to have one file at the end that contains each line of the source file and is sorted.
My first attempt was to split each line into a separate temporary file. The problem is that it heapq.merge()says that there are too many files to sort.
My second attempt was to split it into temporary files on 50,000 lines. The problem is that it seems that it is not sorted by line, but by file. for example, we have something like:
ACGTACGT filename
CGTACGTA filename
ACGTCCGT filename
CGTAAAAA filename
where the first two lines refer to one temporary file, and the last two lines refer to the second file.
My code for sorting them is as follows:
for line in heapq.merge(*[open('/var/tmp/L._Ipsum-strain01.fa_dir/'+str(f),'r') for f in os.listdir('/var/tmp/L._Ipsum-strain01.fa_dir')]):
result.write(line)
result.close()
source
share