In pandas, how can I get a DataFrame as output when I summarize a DataFrame

While I am summing up a DataFrame , it returns Series :

 In [1]: import pandas as pd In [2]: df = pd.DataFrame([[1, 2, 3], [2, 3, 3]], columns=['a', 'b', 'c']) In [3]: df Out[3]: abc 0 1 2 3 1 2 3 3 In [4]: s = df.sum() In [5]: type(s) Out[5]: pandas.core.series.Series 

I know that I can build a new DataFrame this Series . But is there still a β€œpandasic” way?

+4
source share
3 answers

I am going to go further and say ... "No", I don’t think there is a direct way to do this, the pandastic way (and the pythonic one too) should be explicit:

 pd.DataFrame(df.sum(), columns=['sum']) 

or, more elegantly, using a dictionary (remember that this copies the summed array):

 pd.DataFrame({'sum': df.sum()}) 

As @root notes that it uses faster:

 pd.DataFrame(np.sum(df.values, axis=0), columns=['sum']) 

(Since zen of python claims: "practicality is superior to cleanliness," so if you care about this time, use this.)

However, perhaps the most pandastic way is to simply use the series! :)

.

A bit of %timeit for your tiny example:

 In [11]: %timeit pd.DataFrame(df.sum(), columns=['sum']) 1000 loops, best of 3: 356 us per loop In [12]: %timeit pd.DataFrame({'sum': df.sum()}) 1000 loops, best of 3: 462 us per loop In [13]: %timeit pd.DataFrame(np.sum(df.values, axis=0), columns=['sum']) 1000 loops, best of 3: 205 us per loop 

and for a bit more:

 In [21]: df = pd.DataFrame(np.random.randn(100000, 3), columns=list('abc')) In [22]: %timeit pd.DataFrame(df.sum(), columns=['sum']) 100 loops, best of 3: 7.99 ms per loop In [23]: %timeit pd.DataFrame({'sum': df.sum()}) 100 loops, best of 3: 8.3 ms per loop In [24]: %timeit pd.DataFrame(np.sum(df.values, axis=0), columns=['sum']) 100 loops, best of 3: 2.47 ms per loop 
+5
source

I'm not sure about earlier versions, but with pandas 0.18.1 you can use the pandas.Series.to_frame method.

 import pandas as pd df = pd.DataFrame([[1, 2, 3], [2, 3, 3]], columns=['a', 'b', 'c']) s = df.sum().to_frame(name='sum') type(s) >>> pandas.core.frame.DataFrame 

The name argument is optional and specifies the column name.

+2
source
+1
source

Source: https://habr.com/ru/post/1479883/


All Articles