Yes, dask.dataframe can read in parallel. However, you encountered two problems:
Pandas.read_csv only partially frees GIL
By default, dask.dataframe is parallelized with threads, because most pandas can work in parallel on multiple threads (frees GIL). Pandas.read_csv is an exception, especially if your resulting data frames use object dtypes for text
dask.dataframe.to_hdf (file name) forces sequential calculations
Writing to a single HDF file will lead to sequential calculations (it is very difficult to write to a single file in parallel).
Change: new solution
Today I would avoid HDF and use parquet instead. I would probably use multiprocessor or dask.distributed schedulers to avoid GIL problems on the same machine. The combination of these two should give you full linear scaling.
from dask.distributed import Client client = Client() df = dask.dataframe.read_csv(...) df.to_parquet(...)
Decision
Since your data set is likely to fit in memory, use dask.dataframe.read_csv to load in parallel with multiple processes, and then immediately switch to Pandas.
import dask.dataframe as ddf import dask.multiprocessing df = ddf.read_csv("data/Measurements*.csv",
source share