The entire DataFrame must be pickled and printed for each process created by joblib. In practice, it is very slow, and also requires many times the memory of each of them.
One solution is to save your data in HDF format ( df.to_hdf ) using a table format. You can then use select to select a subset of the data for further processing. In practice, this will be too slow for interactive use. It is also very difficult, and your employees will have to keep their work so that it can be consolidated at the last stage.
An alternative would be to study numba.vectorize with target='parallel' . This will require using non-Pandas NumPy arrays, so it also has some difficulties.
In the long run, dask hopes to run Pandas in parallel, but this should not be expected in the near future.
source share