Pandas double crosstab count when using two aggregate functions?

I am not sure if this is what I am doing wrong or do not understand, or if it is potentially an error in a pandas crosstab or numpy sized function.

I am using a simple framework from pandas examples here

>>> df = DataFrame({'A': ['one', 'one', 'two', 'three'] * 6, 'B': ['A', 'B', 'C'] * 8, 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4, 'D': np.random.randn(24), 'E': np.random.randn(24)}) 

Running a simple crosstab with fields = True to get totals works as expected:

 >>> crosstab(rows=[df['A'],df['B']], cols=[df['C']], margins=True) C bar foo All AB one A 2 2 4 B 2 2 4 C 2 2 4 three A 2 0 2 B 0 2 2 C 2 0 2 two A 0 2 2 B 2 0 2 C 0 2 2 All 12 12 24 

Using the np.size function directly gives the same result:

 >>> crosstab(rows=[df['A'],df['B']], cols=[df['C']], margins=True, aggfunc=[np.size]) C bar foo All AB one A 2 2 4 B 2 2 4 C 2 2 4 three A 2 0 2 B 0 2 2 C 2 0 2 two A 0 2 2 B 2 0 2 C 0 2 2 All 12 12 24 

Pandas allows you to transfer multiple aggregation functions to get the score and average in a single crosstab. However, when I do this, the calculations for the size for both foo and the bar are double the size of the previous call, but the overall total value remains correct.

 >>> crosstab(rows=[df['A'],df['B']], cols=[df['C']], margins=True, aggfunc=[np.size, np.mean], values=df['D']) size mean C bar foo All bar foo All AB one A 2 2 4 0.245998 0.076366 0.161182 B 2 2 4 -0.739757 0.137780 -0.300988 C 2 2 4 -1.555759 -1.446554 -1.501157 three A 2 NaN 2 1.216109 NaN 1.216109 B NaN 2 2 NaN 0.255482 0.255482 C 2 NaN 2 0.732448 NaN 0.732448 two A NaN 2 2 NaN -0.273747 -0.273747 B 2 NaN 2 -0.001649 NaN -0.001649 C NaN 2 2 NaN 0.685422 0.685422 All 24 24 24 -0.017102 -0.094208 -0.055655 

Am I missing something? Why does this happen differently in two cases?

+4
source share
1 answer

Well, I have an idea of ​​what he is doing.

After digging the source code into pandas/pandas/tools/pivot.py , it comes to an assertion

 row_margin = data[cols + values].groupby(cols).agg(aggfunc) 

here cols are df ['C'] and the values ​​are df ['D']. We group these two things with cols and then use the aggregate function, which in this case is np.size. Each line looks like

 In [158]: data[cols + values].groupby(cols).nth(0) Out[158]: __dummy__ C bar -1.823026 foo 0.465117 

when we call np.size() what we certainly get 2. Sum all these 2 to get the margin, and we get 24, which is double what you expected if you just wanted D counts

Perhaps someone else can tell us if this will be expected. I'm still a little confused by one piece of source code. I will edit if I find out anything else.

+1
source

Source: https://habr.com/ru/post/1487480/


All Articles