Pandas Error. The average number of int64 columns contained in a group remains as int64 in some circumstances

I find very strange behavior (IMHO) with some data being loaded into pandas from a CSV file. To protect innocence, let's point out that the DataFrame is in the homes variable and, among others, has the following columns:

 In [143]: homes[['zipcode', 'sqft', 'price']].dtypes Out[143]: zipcode int64 sqft int64 price int64 dtype: object 

To get the average price in each zip code, I tried:

 In [146]: homes.groupby('zipcode')[['price']].mean().head(n=5) Out[146]: price zipcode 28001 280804 28002 234284 28003 294111 28004 1355927 28005 810164 

Oddly enough, the average price is int64, as shown in the figure:

 In [147]: homes.groupby('zipcode')[['price']].mean().dtypes Out[147]: price int64 dtype: object 

I can not imagine any technical reason why the average value of a number does not advance to swimming. Moreover, just adding one more column, the price will become float64 as I expected it to be all the time:

 In [148]: homes.groupby('zipcode')[['price', 'sqft']].mean().dtypes Out[148]: price float64 sqft float64 dtype: object price sqft zipcode 28001 280804.690608 14937.450276 28002 234284.035176 7517.633166 28003 294111.278571 10603.096429 28004 1355927.097792 13104.220820 28005 810164.880952 19928.785714 

So that I would not miss something very obvious, I created another very simple DataFrame ( df ), but with this it does not appear:

 In [161]: df[['J','K']].dtypes Out[161]: J int64 K int64 dtype: object In [164]: df[['J','K']].head(n=10) Out[164]: JK 0 0 -9 1 0 -14 2 0 8 3 0 -11 4 0 -7 5 -1 7 6 0 2 7 0 0 8 0 5 9 0 3 In [165]: df.groupby('J')[['K']].mean() Out[165]: K J -2 -2.333333 -1 0.466667 0 -1.030303 1 -1.750000 2 -3.000000 

Note that in one K: int64 column, grouped by J, another int64, the average value is a direct float value. homes DataFrame was read from the provided CSV file, df was created in pandas, written to CSV and then read.

Last but not least, I use pandas 0.16.2.

+5
source share
1 answer

As suggested by some of you in the comments, this is a bug in pandas. I just reported it here .

At the moment, it has been accepted by the pandas team.

thanks

+2
source

Source: https://habr.com/ru/post/1232400/


All Articles