Tangled loop problem (python)

this is similar to the question in merge sort in python I repeat because I don't think I explained this problem well.

Basically, I have about 1000 files containing domain names. in general, data> 1gig, so I try not to load all the data in ram. each individual file was sorted using .sort (get_tld), which sorted the data according to its TLD (not according to its domain name). Sorted all .com together, .orgs together, etc.

a typical file might look like

something.ca
somethingelse.ca
somethingnew.com
another.net
whatever.org
etc.org

but obviosuly more.

Now I want to merge all the files into one, preserving the sort so that in the end one large file still has all .coms together, .orgs together, etc.

What I want to do basically is

open all the files
loop:
    read 1 line from each open file
    put them all in a list and sort with .sort(get_tld)
    write each item from the list to a new file

, , , , open() , 1 , , . , , , .

.

+3
5

1000 , ; , - N , ( , N , N 32, 1000 ). " N " ( , ).

, ( heapq ;-) , " " ( TLD ), , , , ( - , , , ). , - , , , ( , , ;-).

import heapq

def merge(inputfiles, outputfile, key):
  """inputfiles: list of input, sorted files open for reading.
     outputfile: output file open for writing.
     key: callable supplying the "key" to use for each line.
  """
  # prepare the heap: items are lists with [thekey, k, theline, thefile]
  # where k is an arbitrary int guaranteed to be different for all items,
  # theline is the last line read from thefile and not yet written out,
  # (guaranteed to be a non-empty string), thekey is key(theline), and
  # thefile is the open file
  h = [(k, i.readline(), i) for k, i in enumerate(inputfiles)]
  h = [[key(s), k, s, i] for k, s, i in h if s]
  heapq.heapify(h)

  while h:
    # get and output the lowest available item (==available item w/lowest key)
    item = heapq.heappop(h)
    outputfile.write(item[2])

    # replenish the item with the _next_ line from its file (if any)
    item[2] = item[3].readline()
    if not item[2]: continue  # don't reinsert finished files

    # compute the key, and re-insert the item appropriately
    item[0] = key(item[2])
    heapq.heappush(h, item)

, , key, , , , ( ) - urlparse . ,

def tld(domain):
  return domain.rsplit('.', 1)[-1].strip()

- , , .

Python 2.6 , heapq.merge , ( , " " ...) "/undecorate", .

+6

, . heapq.merge. , 1000 . , , , 2 .

+3

, 26 , - : domains-a.dat, domains-b.dat. .

: 3 26+ 26+ , .

26 , , ... domains-ab.dat. , ( Python ), .

+2
source

Your merging algorithm for sorted files is incorrect. What you do is read one line from each file, find the lowest rating among all the lines you read and write them to the output file. Repeat this process (ignoring any files that are in the EOF) until the end of all files is reached.

+1
source
#! /usr/bin/env python

"""Usage: unconfuse.py file1 file2 ... fileN

Reads a list of domain names from each file, and writes them to standard output grouped by TLD.
"""

import sys, os

spools = {}

for name in sys.argv[1:]:
    for line in file(name):
        if (line == "\n"): continue
        tld = line[line.rindex(".")+1:-1]
        spool = spools.get(tld, None)
        if (spool == None):
            spool = file(tld + ".spool", "w+")
            spools[tld] = spool
        spool.write(line)

for tld in sorted(spools.iterkeys()):
    spool = spools[tld]
    spool.seek(0)
    for line in spool:
        sys.stdout.write(line)
    spool.close()
    os.remove(spool.name)
0
source

Source: https://habr.com/ru/post/1761406/


All Articles