I remember MatLab from my days, using structured arrays in which you could store different data as an attribute of the main structure. Something like:
a = {} aA = magic(10) aB = magic(50); etc.
where aA and aB are completely separate from each other, allowing you to store different types inside a and work with them as desired. Pandas allows us to do something similar, but not quite the same.
I use Pandas and want to store the attributes of the framework without actually putting it in the data framework. This can be done with:
import pandas as pd a = pd.DataFrame(data=pd.np.random.randint(0,100,(10,5)),columns=list('ABCED')
Now the local time zone is stored in a, but I can not save this attribute when I save the dataframe (i.e., after reloading a, there is no a.local_tz). Is there a way to keep these attributes?
I am currently just creating new columns in a data framework to store information such as time zone, latitude, length, etc., but this seems a bit empty. Also, when I do data analysis, I run into problems of excluding these other columns.
################### START IMAGE ###################
Using unutbu recommendations, I now store data in h5 format. As already mentioned, loading metadata as attributes of the framework is risky. However, since I am the creator of these files (and processing algorithms), I can choose what is stored as metadata and what is not. When processing the data that will go into h5 files, I want to save the metadata in a dictionary that is initialized as an attribute of my classes. I made a simple I / O class to import h5 data and made metadata attributes of the class. Now I can work with my data files without the risk of losing metadata.
class IO(): def __init__(self): self.dtfrmt = 'dummy_str' def h5load(self,filename,update=False): '''h5load loads the stored HDF5 file. Both the dataframe (actual data) and the associated metadata are stored in the H5file NOTE: This does not load "any" H5 file, it loads H5 files specifically created to hold dataframe data and metadata. When multi-indexed dataframes are stored in the H5 format the date values (previously initialized with timezone information) lose their timezone localization. Therefore, <h5load> re-localizes the 'DATE' index as UTC. Parameters ---------- filename : string/path path and filename of H5 file to be loaded. H5 file must have been created using <h5store> below. udatedf : boolean True/False default: False If the selected dataframe is to be updated then it is imported slightly different. If update==True, the <metadata> attribute is returned as a dictionary and <data> is returned as a dataframe (ie, as a stand-alone dictionary with no attributes, and NOT an instance of the IO() class). Otherwise, if False, <metadata> is returned as an attribute of the class instance. Output ------ data : Pandas dataframe with attributes The dataframe contains only the data as collected by the instrument. Any metadata (eg timezone, scaling factor, basically anything that is constant throughout the file) is stored as an attribute (eg lat is stored as <data.lat>).''' with pd.HDFStore(filename,'r') as store: self.data = store['mydata'] self.metadata = store.get_storer('mydata').attrs.metadata
H5 files are then uploaded via data = IO (). h5load ('filename.h5') dataframe is stored in data.data I save the metadata dictionary in data.metadata and created separate metadata attributes (for example, data.lat created from data.metadata ['lat']).
My index timestamps are localized on pytz.utc (). However, when a multi-indexed data frame is stored in h5, the localization of the time zone is lost (using Pandas 15.2), so I fix this in IO (). H5load.