Aggregation over aggregated values ​​in pandas returns incorrect result

I have a dataframe - each transaction can be displayed more than one, and transactions are associated with stores. I want to find the average transaction cost. To do this, I need to summarize and then find the average value:

#preparind dataset
txt_data = pandas.read_csv("./TestDataSource/txn.csv", sep = ';')
txt_data = txt_data.replace({',': '.'}, regex=True)
txt_data[['SALES']] = txt_data[[ 'SALES']].apply(pd.to_numeric)

enter image description here

len(txt_data.STORE.unique())Only 30 unique STOREs are available here .

First of all, I am merging transactions:

a1 = txt_data[['STORE', 'SALES', 'TXN']].groupby('TXN').sum()[['STORE', 'SALES']]
a.head()

enter image description here

Everything seems to be in order. But then I merge into stores:

a2 = a1.groupby('STORE').mean()
[![enter image description here][3]][3]

But ... list(a2.shape)- returns [1137, 1]. This is really confusing. But also len(a1.STORE.unique())returns 1137

What am I doing wrong

+4
source share
3 answers

, STORE SALES sum per TXN:

a1 = txt_data[['STORE', 'SALES', 'TXN']].groupby('TXN').sum()[['STORE', 'SALES']]

:

a1 = txt_data.groupby('TXN')['STORE', 'SALES'].sum()

TXT STORE :

txt_data = pd.read_csv("txn.csv", sep = ';', decimal=',')

a1 = txt_data.groupby(['TXN', 'STORE'], as_index=False)['SALES'].sum()

print (txt_data.STORE.nunique())
30

print (a1.STORE.nunique())
30
+4

a1 = txt_data[['STORE', 'SALES', 'TXN']].groupby('TXN').sum()[['STORE', 'SALES']]

TXN, pandas , " ", :

txt_data[txt_data['TXN']==5541359000]  

               DAY  STORE   ART                    TXN      TIME    SALES
1268877 2015-10-01  1082    15294488        5541359000  09:30:22    60.2
1269093 2015-10-01  1082    80439           5541359000  09:30:29    15.6
1269309 2015-10-01  1082    191452          5541359000  09:30:15    4.0
1269525 2015-10-01  1082    15317962        5541359000  09:30:17    103.0

a1.head()
           STORE    SALES
TXN     
5541359000  4328    182.8

#1082 * 4 = 4328
+1

, , ,

a1 = txt_data[['STORE', 'SALES', 'TXN']].groupby('TXN').sum()

, txt_data ['STORE']. unique() ,

array([22691, 20581,  1574,  1602,  1579, 29245, 19009, 21761, 17474,
        1544,  1612,  1534,   958, 17096,  1094,  1596,  1594,  1609,
       24605,   956,   961,  1122, 27220,   974,  1082, 25039,  1530,
         999,  1053,   980])

a1 Dataframe, STORE txt_data, group_by.sum() STORE inorder, "TXN". enter image description here

.: STORE = 4328 txt_data ['STORE']. unique()

enter image description here

1082 * 4 = 4328

+1

Source: https://habr.com/ru/post/1695394/


All Articles