In the world of distribution, there is only one thing that you must remember first:
Premature optimization is the root of all evil. D. Knuth
I know this seems obvious, but before distributing the double check you use the best algorithm (if one exists ...). Having said that, distribution optimization is a balancing act between three things:
- Writing / reading data from a constant environment,
- Moving data from environment A to environment B,
- Data processing,
Computers are made so that the closer you get to your processor (3), the faster and more efficient (1) and (2). An order in a classic cluster will be: a network hard drive, a local hard drive, RAM, the inside of the processing unit ... Currently, processors are becoming complex enough to be considered an ensemble of independent hardware processors, usually called cores, these processes process data (3) through streams (2). Imagine that your kernel is so fast that when sending data in one thread, you use 50% of the computer's power, if there are 2 threads in the kernel, you will use 100%. Two threads per core are called hyper threads, and your OS will see 2 processors per hyper kernel.
Thread management in a processor is commonly called multithreaded. Managing processors from the OS is usually called multiprocessing. Managing concurrent tasks in a cluster is usually called concurrent programming. Managing dependent tasks in a cluster is usually called distributed programming.
So where is your bottleneck?
- In (1): try to save and transfer from the upper level (closer to your processor, for example, if the network hard drive is slower to save to the local hard drive first)
- In (2): This is the most common question, try to avoid communication packets that are not needed to distribute or compress packets "on the fly" (for example, if HD is slow, save only the message "batch calculation" and save the intermediate results in RAM).
- In (3): You are done! You use all the computing power at your disposal.
How about celery?
Celery is a messaging environment for distributed programming that will use a brokerage module for communication (2) and a backend module for saving (1), which means that you can change the configuration to avoid most of the bottlenecks (if possible) in your network and only on your network. First create your code to achieve the best performance on a single computer. Then use celery in your cluster with the default configuration and set CELERY_RESULT_PERSISTENT=True :
from celery import Celery app = Celery('tasks', broker='amqp://guest@localhost//', backend='redis://localhost') @app.task def process_id(all_the_data_parameters_needed_to_process_in_this_computer):
At runtime, open your favorite monitoring tools, I use the default for rabbitMQ and the flower for celery and the top for the processor, your results will be saved in your backend. An example of a network bottleneck is that task queues grow so much that they delay execution, you can proceed to change the modules or celery configuration, if not your bottleneck somewhere else.