Read multiple files using multiprocessing

I need to read very large text files (100+ MB), process all lines with a regular expression and store data in a structure. My structure inherits from defaultdict, it has a read method (self) that reads the self.file_name file.

Take a look at this very simple (but not real) example, I am not using a regular expression, but I am splitting the lines:

import multiprocessing from collections import defaultdict def SingleContainer(): return list() class Container(defaultdict): """ this class store odd line in self["odd"] and even line in self["even"]. It is stupid, but it only an example. In the real case the class has additional methods that do computation on readen data. """ def __init__(self,file_name): if type(file_name) != str: raise AttributeError, "%s is not a string" % file_name defaultdict.__init__(self,SingleContainer) self.file_name = file_name self.readen_lines = 0 def read(self): f = open(self.file_name) print "start reading file %s" % self.file_name for line in f: self.readen_lines += 1 values = line.split() key = {0: "even", 1: "odd"}[self.readen_lines %2] self[key].append(values) print "readen %d lines from file %s" % (self.readen_lines, self.file_name) def do(file_name): container = Container(file_name) container.read() return container.items() if __name__ == "__main__": file_names = ["r1_200909.log", "r1_200910.log"] pool = multiprocessing.Pool(len(file_names)) result = pool.map(do,file_names) pool.close() pool.join() print "Finish" 

In the end, I need to combine all the results in one container. It is important that line order is maintained. When returning values, my approach is too slow. The best solution? I am using python 2.6 for Linux

+4
source share
3 answers

You are probably facing two problems.

One of them was mentioned: you read several files at once. These readings will alternate, leading to a beating of the disc. You want to read entire files at once, and then only multithreading of calculations from the data.

Secondly, you are overheading the Python multiprocessing module. It does not actually use threads, but instead starts several processes and serializes the results through a pipe. This is very slow for bulk data - in fact, it seems slower than the work you do in the stream (at least in the example). This is a real world problem caused by GIL.

If I modify do () to return None instead of container.items () to disable an extra copy of the data, this example is faster than a single stream if the files are already cached:

Two threads: 0.36 with a 16 percent processor.

One thread (replace pool.map with a card): 0: 00.52elap 98% CPU

Unfortunately, the GIL problem is fundamental and cannot be used from within Python.

+4
source

Multiprocessing is more suitable for processes oriented to the processor or memory, since the search time for rotary drives reduces the performance when switching between files. Either upload the log files to a fast flash drive, or some kind of memory disk (physical or virtual), or refuse multiprocessing.

0
source

You create a pool with as many workers as files. This may be too much. Usually I try to have about the same number of workers as the number of cores.

The simple fact is that your last step is a single process that brings all the results together. This cannot be avoided, given your description of the problem. This is called barrier synchronization: all tasks must reach the same point before they can continue.

You should probably run this program several times or in a loop, passing a different multiprocessing.Pool() value each time, starting at 1 and moving on to the number of cores. Time each run and see which counter works best.

The result will depend on how important the processor intensity (as opposed to the hard disk) of your task. I would not be surprised if 2 were the best, if your task is about half the processor and half the disk, even on an 8-core machine.

0
source

Source: https://habr.com/ru/post/1271949/


All Articles