I am trying to figure out how to aggregate groups in a Pandas data frame by creating percentages and summing over new columns.
For example, in the next data frame, I have columns A, B, C and D. I would like to combine by groups in A, and C should be a percentage (frequency "1" divided by the frequency of the missing value), and D is the summation no missing values.
For example, for the group "foo", the resulting data frame should be
A B C D
foo 1.333 4
I can do some separate snippets here and there, but not sure how to compile in one single coherent script:
import pandas
from pandas import DataFrame
import numpy as np
df = DataFrame({'A' : ['foo', 'foo', 'foo', 'foo',
'bar', 'bar', 'bar', 'bar'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : [1, np.NaN, 1, 2, np.NaN, 1, 1, 2], 'D' : [2, '', 1, 1, '', 2, 2, 1]})
print df
df.D.replace('', np.NaN, inplace=True)
print df
grouped = df.groupby('A')
grouped_aggre = grouped.aggregate(np.sum)
print grouped_aggre
print df.D.mean()
print df.C.mean()
print '//////////////////'
print df.C.count()
print df.C.value_counts(dropna=True)
Also, how can I aggregate column summaries of columns C and D by columns A and B with the above statistics?
Source data frame:
A B C D
0 foo one 1 2
1 foo one NaN NaN
2 foo two 1 1
3 foo three 2 1
4 bar two NaN NaN
5 bar two 1 2
6 bar one 1 2
7 bar three 2 1
Expected Result:
A B C D
foo 1.333 4
bar 1.333 5