Storing pandas DataFrame with mixed data and category in hdf5

I want to store a data file with different columns in hdf5 file (find the excerpt with data types below).

In [1]: mydf Out [1]: endTime uint32 distance float16 signature category anchorName category stationList object 

Before converting some columns (signature and anchorName to my excerpt above), I used the following code to save it (which works very well):

 path = 'tmp4.hdf5' key = 'journeys' mydf.to_hdf(path, key, mode='w', complevel=9, complib='bzip2') 

But it does not work with category, and then I tried the following:

 path = 'tmp4.hdf5' key = 'journeys' mydf.to_hdf(path, key, mode='w', format='t', complevel=9, complib='bzip2') 

It works fine if I delete the column columnList, where each record is a list of rows. But with this column, I got the following exception:

 Cannot serialize the column [stationList] because its data contents are [mixed] object dtype 

How can I improve the code to get a saved data frame?

pandas version: 0.17.1
python version: 2.7.6 (cannot change it due to convenience reasons)


edit1 (some sample code):

 import pandas as pd mydf = pd.DataFrame({'endTime' : pd.Series([1443525810,1443540836,1443609470]), 'distance' : pd.Series([454.75,477.25,242.12]), 'signature' : pd.Series(['ab','cd','ab']), 'anchorName' : pd.Series(['tec','ing','pol']), 'stationList' : pd.Series([['t1','t2','t3'],['4','t2','t3'],['t3','t2','t4']]) }) # this works fine (no category) mydf.to_hdf('tmp_without_cat.hdf5', 'journeys', mode='w', complevel=9, complib='bzip2') for col in ['anchorName', 'signature']: mydf[col] = mydf[col].astype('category') # this crashes now because of category data # mydf.to_hdf('tmp_with_cat.hdf5', 'journeys', mode='w', complevel=9, complib='bzip2') # switching to format='t' # this caused problems because of "mixed data" in column stationList mydf.to_hdf('tmp_with_cat.hdf5', 'journeys', mode='w', format='t', complevel=9, complib='bzip2') mydf.pop('stationList') # this again works fine mydf.to_hdf('tmp_with_cat_without_stationList.hdf5', 'journeys', mode='w', format='t', complevel=9, complib='bzip2') 

edit2: Meanwhile, I tried different things to get rid of this problem. One of them was to convert the columnList column entries to tuples (perhaps because they should not be changed) and also convert it to a category. But that didn’t change anything. Here are the lines I added after the conversion loop (for completeness only):

 mydf.stationList = [tuple(x) for x in mydf.stationList.values] mydf.stationList.astype('category') 
+5
source share
1 answer

You have two problems:

  • You want to save categorical data in an HDF5 file;
  • You are trying to save arbitrary objects (i.e. stationList ) in an HDF5 file.

As you have discovered, categorical data (currently?) Is only supported in table format for HDF5.

However, storing arbitrary objects (a list of lines, etc.) is not really what is supported by the HDF5 format. Pandas, working around this, serializing these objects with pickle and then saving the pickle as a string of arbitrary length (which, it seems to me, is not supported by all HDF5 formats). But it will be slow and inefficient and will never be well supported by HDF5.

In my opinion, you have two options:

  • Rotate your data to have one row of data by station name. Then you can store all the files in HDF5 format in a table format. (This is good practice overall, see Hadley Wickham on Typical Data .)
  • If you really want to save this format, you can also save the entire data frame using to_pickle (). This will not have a problem with any object (for example, a list of lines, etc.) that you throw at it.

Personally, I would recommend option 1. You can use a fast, binary file format. And the core will also facilitate other operations with your data.

+5
source

Source: https://habr.com/ru/post/1242198/


All Articles