Distributing pandas series metadata through federations

I would like to be able to attach metadata to the dataframes series (in particular, to the original file name) so that after combining the two data frames, I can see the metadata about where each series came from.

I see github issues regarding _metadata ( here , here ), including some related to the current _metadata attribute ( here ), but nothing in pandas docs.

So far, I can change the _metadata attribute to presumably preserve the preservation of metadata, but after combining get an AttributeError .

 df1 = pd.DataFrame(np.random.randint(0, 4, (6, 3))) df2 = pd.DataFrame(np.random.randint(0, 4, (6, 3))) df1._metadata.append('filename') df1[df1.columns[0]]._metadata.append('filename') for c in df1: df1[c].filename = 'fname1.csv' df2[c].filename = 'fname2.csv' df1[0]._metadata # ['name', 'filename'] df1[0].filename # fname1.csv df2[0].filename # fname2.csv df1[0][:3].filename # fname1.csv mgd = pd.merge(df1, df2, on=[0]) mgd['1_x']._metadata # ['name', 'filename'] mgd['1_x'].filename # raises AttributeError 

How to save it?

Update: Epilogue

As discussed here , __finalize__ cannot track the series that are members of the dataframe, only independent series. Therefore, for now, I will track metadata at the Series level, maintaining a dictionary of metadata attached to the data. My code looks like this:

 def cust_merge(d1, d2): "Custom merge function for 2 dicts" ... def finalize_df(self, other, method=None, **kwargs): for name in self._metadata: if method == 'merge': lmeta = getattr(other.left, name, {}) rmeta = getattr(other.right, name, {}) newmeta = cust_merge(lmeta, rmeta) object.__setattr__(self, name, newmeta) else: object.__setattr__(self, name, getattr(other, name, None)) return self df1.filenames = {c: 'fname1.csv' for c in df1} df2.filenames = {c: 'fname2.csv' for c in df2} pd.DataFrame._metadata = ['filenames'] pd.DataFrame.__finalize__ = finalize_df 
+4
python pandas metadata
Apr 21 '14 at 15:36
source share
1 answer

I think something like this will work (and if not, the PLS file reports an error because it is supported and bit-gloss is supported, so it is possible that the connection methods do not call this all the time. The bit is untested).

See issue for a more detailed example / bug fix.

 DataFrame._metadata = ['name','filename'] def __finalize__(self, other, method=None, **kwargs): """ propagate metadata from other to self Parameters ---------- other : the object from which to get the attributes that we are going to propagate method : optional, a passed method name ; possibly to take different types of propagation actions based on this """ ### you need to arbitrate when their are conflicts for name in self._metadata: object.__setattr__(self, name, getattr(other, name, None)) return self DataFrame.__finalize__ = __finalize__ 

Thus, this replaces the default finalizer for the DataFrame with your own. Where I indicated, you need to specify a code that can resolve the conflict between conflicts. For this reason, this is not done by default, for example. frame1 has the name 'foo' and frame2 has the name 'bar', what do you do when the __add__ method, but what about another method ?. Let us know what you do and how it works.

This is ONLY a replacement for the DataFrame (and you can just perform the default action if you want), which should extend the other to itself; you also cannot set anything except for special cases of the method.

This method is intended to override in subclasses why you are trying to render monkeys harmless (and not the subclass, which in most cases is crowded).

+4
Apr 21 '14 at 15:58
source share



All Articles