Python: best I / 0 file using os.fork?

My problem is pretty simple: I have a 400 megabyte file filled with 10,000,000 rows of data. I need to iterate over each line, do something and delete the line from memory so as not to fill up too much RAM.

Since my machine has multiple processors, my initial idea to optimize this process was to create two different processes. It would be possible to read the file several lines at a time and gradually fill the list (one list item - one line in the file). The other will have access to the same list and push the elements () out of it and process them. This will create a list that will grow on one side and shrink from the other.

In other words, this mechanism must implement a buffer that will be constantly filled with lines to complete the second process. But perhaps this is not faster than using:

for line in open('/data/workfile', 'r'):
+3
source share
4 answers

You are probably limited by the speed of your disk. Python is already buffering, so reading line by line is efficient.

+2
source

Your suggestion for line in open('/data/workfile', 'r'):will use a generator, so the whole file will not be read in memory. I would go with this until it is too slow.

+4
source

, , , .

-, .

IO, .

+1

The data structure that you want to use is a queue (it has the appropriate locking mechanisms, for example, for parallel write operations), which is available in the multiprocessing module.

If you do not have a correlation between the processing of your data, you can MAP a linear generator in the process pool with the functions of this module for multicore - include all this in several lines.

See also mapReduce approaches (maybe this is a bit overkill)

0
source

Source: https://habr.com/ru/post/1735622/


All Articles