.sum () in pandas gives inconsistent results

I have a big DataFrame (about 4e + 07 rows).

When summing up, I get 2 significantly different results , whether I carry out the sum before or after selecting a column.
In addition, the type changes from float32 to float64, although the total values ​​are below 2 ** 31

df[[col1, col2, col3]].sum()
Out[1]:
col1         9.36e+07
col2         1.39e+09
col3         6.37e+08
dtype: float32

df.sum()[[col1, col2, col3]]
Out[2]:
col1         1.21e+08
col2         1.70e+09
col3         7.32e+08
dtype: float64

I obviously missed something, did someone have the same problem?

Thank you for your help.

+4
source share
2 answers

You may lose accuracy with np.float32relativelynp.float64

np.finfo(np.float32)

finfo(resolution=1e-06, min=-3.4028235e+38, max=3.4028235e+38, dtype=float32)

and

np.finfo(np.float64)

finfo(resolution=1e-15, min=-1.7976931348623157e+308, max=1.7976931348623157e+308, dtype=float64)

A contrived example

df = pd.DataFrame(dict(
    x=[-60499999.315, 60500002.685] * int(2e7),
    y=[-60499999.315, 60500002.685] * int(2e7),
    z=[-60499999.315, 60500002.685] * int(2e7),
)).astype(dict(x=np.float64, y=np.float32, z=np.float32))

print(df.sum()[['y', 'z']], df[['y', 'z']].sum(), sep='\n\n')

y    80000000.0
z    80000000.0
dtype: float64

y    67108864.0
z    67108864.0
dtype: float32
+1
source

, , , Pandas . , , :

, Pandas DataFrame df ( datetime, ..). df.sum(), Pandas:

, . DataFrame dtypes, 2d NumPy dtype. df float32 int32 (), Pandas dtype, , float64. , , , . , .

, float32, Pandas dtype float32 2d NumPy, sum . , .

, : DataFrame 100 , dtypes float32, float32 int32 . :

>>> import numpy as np, pandas as pd
>>> s = np.ones(10**8, dtype=np.float32)
>>> t = np.ones(10**8, dtype=np.int32)
>>> df = pd.DataFrame(dict(A=s, B=s, C=t))
>>> df.head()
     A    B  C
0  1.0  1.0  1
1  1.0  1.0  1
2  1.0  1.0  1
3  1.0  1.0  1
4  1.0  1.0  1
>>> df.dtypes
A    float32
B    float32
C      int32
dtype: object

, , Pandas float64 s. float64 , .

>>> df.sum()
A    100000000.0
B    100000000.0
C    100000000.0
dtype: float64

dataframe float32, float32 -, .

>>> df[['A', 'B']].sum()
A    16777216.0
B    16777216.0
dtype: float32

, , dtype, : - 1.0 16777216.0 16777216.0 , . float64 .

, Pandas . , , "A":

>>> df[['A']].sum()
A    100000000.0
dtype: float32

! , ? dtypes: float32 . ( NumPy), . , , NumPy , , float32 dtype , . ; . NumPy . "A", "B" (100000000, 2). - 1, 0, , . "A", , .

, DataFrames : (a) , , , (b) - , NumPy .

+3

Source: https://habr.com/ru/post/1688266/


All Articles