Can I update HDFStore?

Consider the following hdfstore and dataframes df and df2

 import pandas as pd store = pd.HDFStore('test.h5') 

 midx = pd.MultiIndex.from_product([range(2), list('XYZ')], names=list('AB')) df = pd.DataFrame(dict(C=range(6)), midx) df C AB 0 X 0 Y 1 Z 2 1 X 3 Y 4 Z 5 

 midx2 = pd.MultiIndex.from_product([range(2), list('VWX')], names=list('AB')) df2 = pd.DataFrame(dict(C=range(6)), midx2) df2 C AB 0 V 0 W 1 X 2 1 V 3 W 4 X 5 

I want to write df to the repository first.

 store.append('df', df) store.get('df') C AB 0 X 0 Y 1 Z 2 1 X 3 Y 4 Z 5 

At a later point in time, I will have another data frame that I want to update in the store. I want to rewrite rows with the same index values ​​as in my new data framework, keeping the old ones.

When i do

 store.append('df', df2) store.get('df') C AB 0 X 0 Y 1 Z 2 1 X 3 Y 4 Z 5 0 V 0 W 1 X 2 1 V 3 W 4 X 5 

This is not at all what I want. Note that (0, 'X') and (1, 'X') repeated. I can manipulate the combined data framework and overwrite, but I expect to work with more data if this is not possible.

How to upgrade storage to get?

  C AB 0 V 0 W 1 X 2 Y 1 Z 2 1 V 3 W 4 X 5 Y 4 Z 5 

You will see that for each level, 'A' , 'Y' and ' Z' same, 'V' and 'W' are new, and 'X' updated.

What is the right way to do this?

+5
source share
1 answer

Idea: first delete the matching rows (with the corresponding index values) with HDF, and then add df2 to the HDFStore.

Problem: I could not find a way to use where="index in df2.index" for indexes with multiple indexes.

Solution: first convert multi-indices to regular ones:

 df.index = df.index.get_level_values(0).astype(str) + '_' + df.index.get_level_values(1).astype(str) df2.index = df2.index.get_level_values(0).astype(str) + '_' + df2.index.get_level_values(1).astype(str) 

this gives:

 In [348]: df Out[348]: C 0_X 0 0_Y 1 0_Z 2 1_X 3 1_Y 4 1_Z 5 In [349]: df2 Out[349]: C 0_V 0 0_W 1 0_X 2 1_V 3 1_W 4 1_X 5 

make sure you use format='t' and data_columns=True (this indexes the index index and index all the columns in the HDF5 file, which allows us to use them in the where clause) when you create / add HDF5 files:

 store = pd.HDFStore('d:/temp/test1.h5') store.append('df', df, format='t', data_columns=True) store.close() 

now we can first remove these lines from the HDFStore with the corresponding indices:

 store = pd.HDFStore('d:/temp/test1.h5') In [345]: store.remove('df', where="index in df2.index") Out[345]: 2 

and add df2 :

 In [346]: store.append('df', df2, format='t', data_columns=True, append=True) 

Result:

 In [347]: store.get('df') Out[347]: C 0_Y 1 0_Z 2 1_Y 4 1_Z 5 0_V 0 0_W 1 0_X 2 1_V 3 1_W 4 1_X 5 
+3
source

Source: https://habr.com/ru/post/1270796/


All Articles