Process large file in chunks

I have a large file with two numbers in a row and sorted by the second column. I am making a dictionary of lists entered in the first column.

My code looks like

from collections import defaultdict d = defaultdict(list) for line in fin.readline(): vals = line.split() d[vals[0]].append(vals[1]) process(d) 

However, the large input file is too large, so d will not fit into memory.

To get around this, I can basically read the pieces of the file at a time, but I need to do an overlap between the pieces so that process(d) doesn't skip anything.

In pseudo code, I could do the following.

  • Read the 100 lines creating the dictionary d .
  • Process Dictionary d
  • Remove everything from d that is not within 10 of the maximum possible value.
  • Repeat but make sure that d no more than 100 rows of data at any time.

Is there a good way to do this in python?

Refresh. More about the problem. I will use d when reading pairs in the second file, where I will output a pair, depending on how many values ​​are in the list associated with the first value in d , which are within 10. The second file is also sorted by the second column.

Fake data. Let's say we can put 5 rows of data in memory, and we need the overlap in the values ​​to be 5.

 1 1 2 1 1 6 7 6 1 16 

So now d is {1: [1,6,16], 2: [1], 7: [6]}.

For the next fragment, we need to save the last value (as 16-6> 5). Therefore we set

d will be {1: [16]} and continue reading the next lines 4 .

+4
source share
3 answers

Have you tried the Pandas library and, in particular, read your data in a DataFrame, then using groupby in the first column?

Pandas will allow you to efficiently perform bulk operations on your data, and you can read it lazily if you want.

+2
source

You do not need a default by default if something strange does not happen to the file, but you did not mention what it is. Instead, use a list that stores your data in string order, so you can process it using the appropriate snippets, like this:

 d = [] for line in fin.readline(): vals = line.split() d.append(vals) if not len(d)%100: process(d) d = d[90:] process(d) 
0
source

You can do it something like this:

 n_process = 100 n_overlap = 10 data_chunk = [] for line in fin.readline(): vals = line.split() data_chunk.append(vals) if len(data_chunk) == n_process: process(data_chunk) data_chunk = data_chunk[-n_overlap:] 

When using a dictionary, data can be overwritten if there are several occurrences of numbers in the first column in the data sample. Also note that you need to use OrderedDict , since the dict has no order in python. However, in my opinion, an OrderedDict in most cases a sign of bad code.

And by the way: we still don’t know why you are trying to make this path ...

0
source

Source: https://habr.com/ru/post/1493512/


All Articles