I am not sure if this is what I am doing wrong or do not understand, or if it is potentially an error in a pandas crosstab or numpy sized function.
I am using a simple framework from pandas examples here
>>> df = DataFrame({'A': ['one', 'one', 'two', 'three'] * 6, 'B': ['A', 'B', 'C'] * 8, 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4, 'D': np.random.randn(24), 'E': np.random.randn(24)})
Running a simple crosstab with fields = True to get totals works as expected:
>>> crosstab(rows=[df['A'],df['B']], cols=[df['C']], margins=True) C bar foo All AB one A 2 2 4 B 2 2 4 C 2 2 4 three A 2 0 2 B 0 2 2 C 2 0 2 two A 0 2 2 B 2 0 2 C 0 2 2 All 12 12 24
Using the np.size function directly gives the same result:
>>> crosstab(rows=[df['A'],df['B']], cols=[df['C']], margins=True, aggfunc=[np.size]) C bar foo All AB one A 2 2 4 B 2 2 4 C 2 2 4 three A 2 0 2 B 0 2 2 C 2 0 2 two A 0 2 2 B 2 0 2 C 0 2 2 All 12 12 24
Pandas allows you to transfer multiple aggregation functions to get the score and average in a single crosstab. However, when I do this, the calculations for the size for both foo and the bar are double the size of the previous call, but the overall total value remains correct.
>>> crosstab(rows=[df['A'],df['B']], cols=[df['C']], margins=True, aggfunc=[np.size, np.mean], values=df['D']) size mean C bar foo All bar foo All AB one A 2 2 4 0.245998 0.076366 0.161182 B 2 2 4 -0.739757 0.137780 -0.300988 C 2 2 4 -1.555759 -1.446554 -1.501157 three A 2 NaN 2 1.216109 NaN 1.216109 B NaN 2 2 NaN 0.255482 0.255482 C 2 NaN 2 0.732448 NaN 0.732448 two A NaN 2 2 NaN -0.273747 -0.273747 B 2 NaN 2 -0.001649 NaN -0.001649 C NaN 2 2 NaN 0.685422 0.685422 All 24 24 24 -0.017102 -0.094208 -0.055655
Am I missing something? Why does this happen differently in two cases?