Understanding the use of CPU processors of a multiprocessor module

I have a simple main() function that processes a huge amount of data. Since I have an 8-core machine with a lot of bars, I was asked to use the multiprocessing python module to speed up processing. Each subprocess will take about 18 hours.

In short, I have doubts that I correctly understood the behavior of the multiprocessing module.

I somehow start different subprocesses, for example:

 def main(): data = huge_amount_of_data(). pool = multiprocessing.Pool(processes=cpu_cores) # cpu_cores is set to 8, since my cpu has 8 cores. pool.map(start_process, data_chunk) # data_chunk is a subset data. 

I understand that the beginning of this script is its own process, namely the main process that ends after all subprocesses are completed. Obviously, the Main process does not feed a lot of resources, since it will first prepare the data and generate subprocesses. Will he use the kernel for himself too? The value will only be for starting 7 subprocesses instead of 8, which I liked to start above?

The main question: can I create 8 subprocesses and make sure that they will work correctly parallel to each other?

By the way, the subprocesses in no way interact with each other, and when they are finished, each of them generates a sqlite database file in which they store the results. Thus, even result_storage is processed separately.

What I want to avoid is that I create a process that will prevent others from working at full speed. I need code to complete in approximately 16 hours, not twice, because I have more processes than cores .:-)

+4
source share
2 answers

Aside, if you create a pool without arguments, if you automatically select the number of available cores using the result of cpu_count() .

In any modern multi-tasking OS, no program, as a rule, will be able to support the main core and not allow the launch of other programs.

How many workers you have to start depends on the characteristics of your start_process function. The number of cores is not the only consideration.

If each workflow uses, for example, 1/4 of the available memory, starting over 3, will lead to a lot of swap and overall slowdown. This condition is called "memory."

If workflows perform actions other than a calendar (for example, reading or writing to a disk), they will have to wait a lot (since the disk is much slower than RAM, this is called "IO binding"). In this case, it may be appropriate to run more than one worker on the kernel.

If workers are not memory related or IO related, they will be limited by the number of cores.

+4
source

The OS will control which processes will be assigned to which core, because there are other application processes, you cannot guarantee that you have all 8 cores available for your application.

The main thread will contain its own process, but since the map () function is blocked, the process will probably also be blocked, but will not use the processor core.

+1
source

Source: https://habr.com/ru/post/1398484/


All Articles