Save additional attributes in Pandas Dataframe

I remember MatLab from my days, using structured arrays in which you could store different data as an attribute of the main structure. Something like:

a = {} aA = magic(10) aB = magic(50); etc. 

where aA and aB are completely separate from each other, allowing you to store different types inside a and work with them as desired. Pandas allows us to do something similar, but not quite the same.

I use Pandas and want to store the attributes of the framework without actually putting it in the data framework. This can be done with:

 import pandas as pd a = pd.DataFrame(data=pd.np.random.randint(0,100,(10,5)),columns=list('ABCED') # now store an attribute of <a> a.local_tz = 'US/Eastern' 

Now the local time zone is stored in a, but I can not save this attribute when I save the dataframe (i.e., after reloading a, there is no a.local_tz). Is there a way to keep these attributes?

I am currently just creating new columns in a data framework to store information such as time zone, latitude, length, etc., but this seems a bit empty. Also, when I do data analysis, I run into problems of excluding these other columns.

################### START IMAGE ###################

Using unutbu recommendations, I now store data in h5 format. As already mentioned, loading metadata as attributes of the framework is risky. However, since I am the creator of these files (and processing algorithms), I can choose what is stored as metadata and what is not. When processing the data that will go into h5 files, I want to save the metadata in a dictionary that is initialized as an attribute of my classes. I made a simple I / O class to import h5 data and made metadata attributes of the class. Now I can work with my data files without the risk of losing metadata.

 class IO(): def __init__(self): self.dtfrmt = 'dummy_str' def h5load(self,filename,update=False): '''h5load loads the stored HDF5 file. Both the dataframe (actual data) and the associated metadata are stored in the H5file NOTE: This does not load "any" H5 file, it loads H5 files specifically created to hold dataframe data and metadata. When multi-indexed dataframes are stored in the H5 format the date values (previously initialized with timezone information) lose their timezone localization. Therefore, <h5load> re-localizes the 'DATE' index as UTC. Parameters ---------- filename : string/path path and filename of H5 file to be loaded. H5 file must have been created using <h5store> below. udatedf : boolean True/False default: False If the selected dataframe is to be updated then it is imported slightly different. If update==True, the <metadata> attribute is returned as a dictionary and <data> is returned as a dataframe (ie, as a stand-alone dictionary with no attributes, and NOT an instance of the IO() class). Otherwise, if False, <metadata> is returned as an attribute of the class instance. Output ------ data : Pandas dataframe with attributes The dataframe contains only the data as collected by the instrument. Any metadata (eg timezone, scaling factor, basically anything that is constant throughout the file) is stored as an attribute (eg lat is stored as <data.lat>).''' with pd.HDFStore(filename,'r') as store: self.data = store['mydata'] self.metadata = store.get_storer('mydata').attrs.metadata # metadata gets stored as attributes, so no need to make <metadata> an attribute of <self> # put metadata into <data> dataframe as attributes for r in self.metadata: setattr(self,r,self.metadata[r]) # unscale data self.data, self.metadata = unscale(self.data,self.metadata,stringcols=['routine','date']) # when pandas stores multi-index dataframes as H5 files the timezone # initialization is lost. Remake index with timezone initialized: only # for multi-indexed dataframes if isinstance(self.data.index,pd.core.index.MultiIndex): # list index-level names, and identify 'DATE' level namen = self.data.index.names date_lev = namen.index('DATE') # extract index as list and remake tuples with timezone initialized new_index = pd.MultiIndex.tolist(self.data.index) for r in xrange( len(new_index) ): tmp = list( new_index[r] ) tmp[date_lev] = utc.localize( tmp[date_lev] ) new_index[r] = tuple(tmp) # reset multi-index self.data.index = pd.MultiIndex.from_tuples( new_index, names=namen ) if update: return self.metadata, self.data else: return self def h5store(self,data, filename, **kwargs): '''h5store stores the dataframe as an HDF5 file. Both the dataframe (actual data) and the associated metadata are stored in the H5file Parameters ---------- data : Pandas dataframe NOT a class instance Must be a dataframe, not a class instance (ie cannot be an instance named <data> that has an attribute named <data> (eg the Pandas data frame is stored in data.data)). If the dataframe is under data.data then the input variable must be data.data. filename : string/path path and filename of H5 file to be loaded. H5 file must have been created using <h5store> below. **kwargs : dictionary dictionary containing metadata information. Output ------ None: only saves data to file''' with pd.HDFStore(filename,'w') as store: store.put('mydata',data) store.get_storer('mydata').attrs.metadata = kwargs 

H5 files are then uploaded via data = IO (). h5load ('filename.h5') dataframe is stored in data.data I save the metadata dictionary in data.metadata and created separate metadata attributes (for example, data.lat created from data.metadata ['lat']).

My index timestamps are localized on pytz.utc (). However, when a multi-indexed data frame is stored in h5, the localization of the time zone is lost (using Pandas 15.2), so I fix this in IO (). H5load.

+9
pandas
Mar 18 '15 at 17:43
source share
2 answers

There is an open problem regarding the storage of user metadata in NDFrames. But due to the many ways pandas functions can return DataFrames, the _metadata attribute _metadata not saved (yet) in all situations.

For now, you just need to save the metadata in a helper variable.

There are many options for storing DataFrames + metadata files in files, depending on which format you want to use - brine, JSON, HDF5 - all the features.

Here's how you can store and load a DataFrame using metadata using HDF5. The recipe for storing metadata comes from Pandas Cookbook .

 import numpy as np import pandas as pd def h5store(filename, df, **kwargs): store = pd.HDFStore(filename) store.put('mydata', df) store.get_storer('mydata').attrs.metadata = kwargs store.close() def h5load(store): data = store['mydata'] metadata = store.get_storer('mydata').attrs.metadata return data, metadata a = pd.DataFrame( data=pd.np.random.randint(0, 100, (10, 5)), columns=list('ABCED')) filename = '/tmp/data.h5' metadata = dict(local_tz='US/Eastern') h5store(filename, a, **metadata) with pd.HDFStore(filename) as store: data, metadata = h5load(store) print(data) # ABCED # 0 9 20 92 43 25 # 1 2 64 54 0 63 # 2 22 42 3 83 81 # 3 3 71 17 64 53 # 4 52 10 41 22 43 # 5 48 85 96 72 88 # 6 10 47 2 10 78 # 7 30 80 3 59 16 # 8 13 52 98 79 65 # 9 6 93 55 40 3 



 print(metadata) 

gives

 {'local_tz': 'US/Eastern'} 
+20
Mar 18 '15 at 18:39
source share
β€” -

The approach I'm using is to add extra MultiIndex levels to store the extra information I want (I use columns, but any of them will work). All of these columns have the same values ​​for these additional parameters. This is also useful since I can combine multiple data frames or separate separate columns, and these values ​​are saved.

 >>> col=pd.MultiIndex.from_product([['US/Eastern'], ['A', 'B', 'C', 'E', 'D']], names=['local_tz', 'name']) >>> a = pd.DataFrame(data=pd.np.random.randint(0,100,(10,5)),columns=col) >>> print(a) local_tz US/Eastern name ABCED 0 38 93 63 24 55 1 21 25 84 98 62 2 4 60 78 0 5 3 26 50 82 89 23 4 32 70 80 90 1 5 6 17 8 60 59 6 95 98 69 19 76 7 71 90 45 45 40 8 94 16 44 60 16 9 53 8 30 4 72 
+5
Mar 18 '15 at 18:12
source share



All Articles