I would like to be able to attach metadata to the dataframes series (in particular, to the original file name) so that after combining the two data frames, I can see the metadata about where each series came from.
I see github issues regarding _metadata ( here , here ), including some related to the current _metadata attribute ( here ), but nothing in pandas docs.
So far, I can change the _metadata attribute to presumably preserve the preservation of metadata, but after combining get an AttributeError .
df1 = pd.DataFrame(np.random.randint(0, 4, (6, 3))) df2 = pd.DataFrame(np.random.randint(0, 4, (6, 3))) df1._metadata.append('filename') df1[df1.columns[0]]._metadata.append('filename') for c in df1: df1[c].filename = 'fname1.csv' df2[c].filename = 'fname2.csv' df1[0]._metadata # ['name', 'filename'] df1[0].filename # fname1.csv df2[0].filename # fname2.csv df1[0][:3].filename # fname1.csv mgd = pd.merge(df1, df2, on=[0]) mgd['1_x']._metadata # ['name', 'filename'] mgd['1_x'].filename # raises AttributeError
How to save it?
Update: Epilogue
As discussed here , __finalize__ cannot track the series that are members of the dataframe, only independent series. Therefore, for now, I will track metadata at the Series level, maintaining a dictionary of metadata attached to the data. My code looks like this:
def cust_merge(d1, d2): "Custom merge function for 2 dicts" ... def finalize_df(self, other, method=None, **kwargs): for name in self._metadata: if method == 'merge': lmeta = getattr(other.left, name, {}) rmeta = getattr(other.right, name, {}) newmeta = cust_merge(lmeta, rmeta) object.__setattr__(self, name, newmeta) else: object.__setattr__(self, name, getattr(other, name, None)) return self df1.filenames = {c: 'fname1.csv' for c in df1} df2.filenames = {c: 'fname2.csv' for c in df2} pd.DataFrame._metadata = ['filenames'] pd.DataFrame.__finalize__ = finalize_df