Big Pandas parallel data processing

I refer to a very large Pandas data file as a global variable. This variable is implemented in parallel through joblib .

Eg.

df = db.query("select id, a_lot_of_data from table") def process(id): temp_df = df.loc[id] temp_df.apply(another_function) Parallel(n_jobs=8)(delayed(process)(id) for id in df['id'].to_list()) 

Accessing the original df in this way seems to copy the data through the processes. Is this unexpected since the original df does not change in any of the subprocesses? (or that?)

+5
source share
2 answers

The entire DataFrame must be pickled and printed for each process created by joblib. In practice, it is very slow, and also requires many times the memory of each of them.

One solution is to save your data in HDF format ( df.to_hdf ) using a table format. You can then use select to select a subset of the data for further processing. In practice, this will be too slow for interactive use. It is also very difficult, and your employees will have to keep their work so that it can be consolidated at the last stage.

An alternative would be to study numba.vectorize with target='parallel' . This will require using non-Pandas NumPy arrays, so it also has some difficulties.

In the long run, dask hopes to run Pandas in parallel, but this should not be expected in the near future.

+7
source

Python multiprocessing is usually performed using separate processes, as you noted, which means that processes do not exchange memory. There is a potential workaround if you can get things to work with np.memmap , as mentioned a bit further on joblib docs, although flushing to disk will obviously add some overhead: https://pythonhosted.org/joblib/parallel.html# working-with-numerical-data-in-shared-memory-memmaping

+1
source

Source: https://habr.com/ru/post/1235557/


All Articles