I have a large file with two numbers in a row and sorted by the second column. I am making a dictionary of lists entered in the first column.
My code looks like
from collections import defaultdict d = defaultdict(list) for line in fin.readline(): vals = line.split() d[vals[0]].append(vals[1]) process(d)
However, the large input file is too large, so d will not fit into memory.
To get around this, I can basically read the pieces of the file at a time, but I need to do an overlap between the pieces so that process(d) doesn't skip anything.
In pseudo code, I could do the following.
- Read the 100 lines creating the dictionary
d . - Process Dictionary
d - Remove everything from
d that is not within 10 of the maximum possible value. - Repeat but make sure that
d no more than 100 rows of data at any time.
Is there a good way to do this in python?
Refresh. More about the problem. I will use d when reading pairs in the second file, where I will output a pair, depending on how many values ββare in the list associated with the first value in d , which are within 10. The second file is also sorted by the second column.
Fake data. Let's say we can put 5 rows of data in memory, and we need the overlap in the values ββto be 5.
1 1 2 1 1 6 7 6 1 16
So now d is {1: [1,6,16], 2: [1], 7: [6]}.
For the next fragment, we need to save the last value (as 16-6> 5). Therefore we set
d will be {1: [16]} and continue reading the next lines 4 .
source share