Combining 2 very large text files, updating each line, without memory usage

Question

Combining 2 very large text files, updating each line, without memory usage

Say I have 2 text files with approximately 2 million lines each (file size 50-80 MB each). The structure of both files is the same:

Column1 Column2 Column3 ...

Column 1 never changes, Column 2 : the same value may not be in both files and will not be in the same order for both files, Column3 is a number and will differ in each file.

I need to merge them into a single file, corresponding to column 2. If column 2 exists in both files, update column 3 by adding the values of column 3 from both files together.

If the files were not so huge, I could easily do this in PHP by reading each line of both files in arrays and from there, but it easily overloads the available memory.

Is there a way to do this without loading each line into memory? I am mostly familiar with PHP, but I open Python, Java or Shell scripts if they are not too complicated to understand.

+6

python merge php memory file-io

Brucial Aug 29 '11 at 22:16

source share

5 answers

Jimmy · Answer 1 · 2011-08-30T00:04:43+0000

I would go with the sort(1) command line to merge and sort the files. After that, there should be just a script to calculate the amounts. I do not know PHP, so I will give my example in python:

 sort -k2 <file1> <file2> | python -c " import itertools,sys allLines = (x.strip().split(' ') for x in sys.stdin) groups = itertools.groupby(allLines, lambda x:x[1]) for k,lines in groups: firstLine = iter(g).next() print firstLine[0], firstline[1], sum(int(x[2]) for x in lines) "

Marc b · Answer 2 · 2011-08-29T22:34:40+0000

Well, therefore, if I read this correctly, you will have:

file1:

 abc 12 34 abc 56 78 abc 90 12

file2:

 abc 90 87 <-- common column 2 abc 12 67 <---common column 2 abc 23 1 <-- unique column 2

the conclusion should be:

 abc 12 101 abc 90 99

If this is the case, then something like this (assuming they are formatted in .csv format):

 $f1 = fopen('file1.txt', 'rb'); $f2 = fopen('file2.txt', 'rb'); $fout = fopen('outputxt.'); $data = array(); while(1) { if (feof($line1) || feof($line2)) { break; // quit if we hit the end of either file } $line1 = fgetcsv($f1); if (isset($data[$line1[1]])) { // saw the col2 value earlier, so do the math for the output file: $col3 = $line1[2] + $data[$line1[1]]; $output = array($line[0], $line1[1], $col3); fputcsv($fout, $output); unset($data[$line1[1]]); } else { $data[$line1[1]] = $line1; // cache the line, if the col2 value wasn't seen already } $line2 = fgetcsv($f2); if (isset($data[$line2[1]])) { $col3 = $data[$line2[1]] + $line2[2]; $newdata = array($line2[0], $line2[1], $col3); fputcsv($fout, $newdata); unset($data[$line2[1]]); // remove line from cache } else { $data[$line2[1]] = $line2; } } fclose($f1); fclose($f2); fclose($fout);

This is leaving my mind, not being tested, probably will not work, YMMV, etc.

This will make things easier if you pre-sort the two input files so that column2 is used as the sort key. This will reduce the cache size, as you know, if you had already seen a comparable value and when dumped previously saved data.

Malvolio · Answer 3 · 2011-08-30T00:04:51+0000

What can leave you alone is that you are looking at two files. It is not necessary. To use the excellent Mark: file1 example:

 abc 12 34 abc 56 78 abc 90 12

file2:

 abc 90 87 abc 12 67 abc 23 1

then

 sort file1 file2 > file3

gives file3:

 abc 12 34 abc 12 67 abc 23 1 abc 56 78 abc 90 12 abc 90 87

Second week of CS-101 to reduce it to its final form.

PabloG · Answer 4 · 2011-08-30T00:12:37+0000

You can easily solve this with the Python sqlite3 module without consuming much memory (about 13 MB with 1 million lines):

 import sqlite3 files = ("f1.txt", "f2.txt") # Files to compare # # Create test data # for file_ in files: # f = open(file_, "w") # fld2 = 0 # for fld1 in "abc def ghi jkl".split(): # for fld3 in range(1000000 / 4): # fld2 += 1 # f.write("%s %s %s\n" % (fld1, fld2, 1)) # # f.close() sqlite_file = "./join.tmp" # or :memory: if you don't want to create a file cnx = sqlite3.connect(sqlite_file) for file_ in range(len(files)): # Create & load tables table = "file%d" % (file_+1) cnx.execute("drop table if exists %s" % table) cnx.execute("create table %s (fld1 text, fld2 int primary key, fld3 int)" % table) for line in open(files[file_], "r"): cnx.execute("insert into %s values (?,?,?)" % table, line.split()) # Join & result cur = cnx.execute("select f1.fld1, f1.fld2, (f1.fld3+f2.fld3) from file1 f1 join file2 f2 on f1.fld2==f2.fld2") while True: row = cur.fetchone() if not row: break print row[0], row[1], row[2] cnx.close()

Ben · Answer 5 · 2011-08-29T23:25:37+0000

PHP memory_limit is suitable for its primary task of scripting a web server. This is extremely inappropriate for batch processing data, such as the work you are trying to do. The problem is that PHP is configured with memory_limit and not that you are trying to do something that requires "too much" memory. My phone has enough memory to just load 2 80 MB files into memory and do it in a quick / easy way, not to mention any kind of real computer that should be able to download gigabytes (or at least 1 GB) of data without breaking a sweat .

Apparently, you can set PHP memory_limit (which is arbitrary and very small by today's standards) at runtime with ini_set , just for that script. Do you know how much memory you have on the server? I know that many hosting providers do provide you with very little memory by today's standards, because they do not expect you to do much more than process web page requests. But you can probably just do it directly in PHP the way you want, without jumping over hoops (and significantly slowing down the process) to try not to load all the files into memory at once.

Combining 2 very large text files, updating each line, without memory usage

More articles: