Sort a large file using Python heapq.merge

Question

Sort a large file using Python heapq.merge

I want to do this kind of work, but have run into difficulties:

I have a huge text file. Each line has a format "AGTCCCGGAT filename"where the first part is a DNA thing.

The professor offers us to split this huge file into many temporary files and use heapq.merge()them to sort them. The goal is to have one file at the end that contains each line of the source file and is sorted.

My first attempt was to split each line into a separate temporary file. The problem is that it heapq.merge()says that there are too many files to sort.

My second attempt was to split it into temporary files on 50,000 lines. The problem is that it seems that it is not sorted by line, but by file. for example, we have something like:

ACGTACGT filename
CGTACGTA filename
ACGTCCGT filename
CGTAAAAA filename

where the first two lines refer to one temporary file, and the last two lines refer to the second file.

My code for sorting them is as follows:

for line in heapq.merge(*[open('/var/tmp/L._Ipsum-strain01.fa_dir/'+str(f),'r') for f in os.listdir('/var/tmp/L._Ipsum-strain01.fa_dir')]):
     result.write(line)
result.close()

+4

python sorting

Destino May 03 '14 at 21:42

source share

1 answer

Antti Haapala · Answer 1 · 2016-09-06T13:19:47+0000

. , , . , : -, 50k , , . .

from heapq import merge
from itertools import count, islice
from contextlib import ExitStack  # not available on Python 2
                                  # need to care for closing files otherwise

chunk_names = []

# chunk and sort
with open('input.txt') as input_file:
    for chunk_number in count(1):
        # read in next 50k lines and sort them
        sorted_chunk = sorted(islice(input_file, 50000))
        if not sorted_chunk:
            # end of input
            break

        chunk_name = 'chunk_{}.chk'.format(chunk_number)
        chunk_names.append(chunk_name)
        with open(chunk_name, 'w') as chunk_file:
            chunk_file.writelines(sorted_chunk)

with ExitStack() as stack, open('output.txt', 'w') as output_file:
    files = [stack.enter_context(open(chunk)) for chunk in chunk_names]
    output_file.writelines(merge(*files))

Sort a large file using Python heapq.merge

More articles: