Process large file in chunks

Question

Process large file in chunks

I have a large file with two numbers in a row and sorted by the second column. I am making a dictionary of lists entered in the first column.

My code looks like

from collections import defaultdict d = defaultdict(list) for line in fin.readline(): vals = line.split() d[vals[0]].append(vals[1]) process(d)

However, the large input file is too large, so d will not fit into memory.

To get around this, I can basically read the pieces of the file at a time, but I need to do an overlap between the pieces so that process(d) doesn't skip anything.

In pseudo code, I could do the following.

Read the 100 lines creating the dictionary d .
Process Dictionary d
Remove everything from d that is not within 10 of the maximum possible value.
Repeat but make sure that d no more than 100 rows of data at any time.

Is there a good way to do this in python?

Refresh. More about the problem. I will use d when reading pairs in the second file, where I will output a pair, depending on how many values are in the list associated with the first value in d , which are within 10. The second file is also sorted by the second column.

Fake data. Let's say we can put 5 rows of data in memory, and we need the overlap in the values to be 5.

 1 1 2 1 1 6 7 6 1 16

So now d is {1: [1,6,16], 2: [1], 7: [6]}.

For the next fragment, we need to save the last value (as 16-6> 5). Therefore we set

d will be {1: [16]} and continue reading the next lines 4 .

+4

python

phoenix Jul 25 '13 at 19:01

source share

3 answers

You do not need a default by default if something strange does not happen to the file, but you did not mention what it is. Instead, use a list that stores your data in string order, so you can process it using the appropriate snippets, like this:

 d = [] for line in fin.readline(): vals = line.split() d.append(vals) if not len(d)%100: process(d) d = d[90:] process(d)

0

Tom rose Jul 25 '13 at 19:19

source share

You can do it something like this:

 n_process = 100 n_overlap = 10 data_chunk = [] for line in fin.readline(): vals = line.split() data_chunk.append(vals) if len(data_chunk) == n_process: process(data_chunk) data_chunk = data_chunk[-n_overlap:]

When using a dictionary, data can be overwritten if there are several occurrences of numbers in the first column in the data sample. Also note that you need to use OrderedDict , since the dict has no order in python. However, in my opinion, an OrderedDict in most cases a sign of bad code.

And by the way: we still don’t know why you are trying to make this path ...

0

tamasgal Jul 25 '13 at 19:19

source share

Kyle kelley · Accepted Answer · 2013-07-25T20:27:19+0000

Have you tried the Pandas library and, in particular, read your data in a DataFrame, then using groupby in the first column?

Pandas will allow you to efficiently perform bulk operations on your data, and you can read it lazily if you want.

Process large file in chunks

More articles: