Pandas: df_left.merge (df_right) Summary Statistics

Question

Pandas: df_left.merge (df_right) Summary Statistics

As for the Pandas: df.merge () method, this is their convenient way to get summary merge statistics (e.g. number of matches, number of inconsistencies, etc.). I know that this statistic depends on the how = 'inner' flag, but it would be useful to know how much is “discarded” when using an internal join, etc. I could just use:

df = df_left.merge(df_right, on='common_column', how='inner') set1 = set(df_left[common_column].unique()) set2 = set(df_right[common_column].unique()) set1.issubset(set2) #True No Further Analysis Required set2.issubset(set1) #False num_shared = len(set2.intersection(set1)) num_diff = len(set2.difference(set1)) # And So on ...

But I thought that this could be already implemented. I skipped this (e.g. something like report = True for a merge that will return new_dataframe and a series of reports or dataframe)

+4

python pandas

sanguineturtle Jun 16 '13 at 23:36

source share

1 answer

sanguineturtle · Answer 1 · 2013-06-17T01:14:30+0000

This is what I use so far. This is part of a function that matches data from one encoding system to another encoding system.

 if report == True: report_df = pd.DataFrame(data[match_on].describe(), columns=['left']) report_df = report_df.merge(pd.DataFrame(concord[match_on].describe(), columns=['right']), left_index=True, right_index=True) set_left = set(data[match_on]) set_right = set(concord[match_on]) set_info = pd.DataFrame({'left':set_left.issubset(set_right), 'right':set_right.issubset(set_left)}, index=['subset']) report_df = report_df.append(set_info) set_info = pd.DataFrame({'left':len(set_left.difference(set_right)), 'right':len(set_right.difference(set_left))}, index=['differences']) report_df = report_df.append(set_info) #Return Random Sample of [5 Differences] left_diff = list(set_left.difference(set_right))[0:5] if len(left_diff) < 5: left_diff = (left_diff + [np.nan]*5)[0:5] right_diff = list(set_right.difference(set_left))[0:5] if len(right_diff) < 5: right_diff = (right_diff + [np.nan]*5)[0:5] set_info = pd.DataFrame({'left': left_diff, 'right': right_diff}, index=['diff1', 'diff2', 'diff3', 'diff4', 'diff5']) report_df = report_df.append(set_info)

Report Example

Report sample

Pandas: df_left.merge (df_right) Summary Statistics

More articles: