Performance of numpy vs pandas

Question

Performance of numpy vs pandas

I process and work with fairly large data frames in pandas. Although I am not necessarily interested in maximizing the performance of my code as efficiently as possible, I am extremely surprised by the significant performance gap that I observe for different versions of the same task when I use only pandas, a hybrid version that combines pandas> + dictand only numpy. I am not sure about pandas since I am in numpy, and therefore I would like to know if pandas is internally slower than numpy for certain tasks (but more elegant when you write code), or it may be caused in part by poor use of pandas .

a common problem

I have a matrix of NxMfloats on a monthly scale (where N and M are about 1000 and 5000, respectively). For exogenous reasons, these matrices are represented as a pandas frame, one for each timestamp. For each date, I need to subtract the matrix at this time using an elementary date-1element. Then I need to binarize this result in order to have 1 if the element has a difference of 1, 0 otherwise. Finally, for each date, I need to sum along axis=1, i.e. Along the rows of this binarized matrix. It is henceforth dict_MVAdja dictionary, where the keys are dates and values - these are NxM dataframes. Data frames may contain values NaN:

dict_MVAdj  -> {datetime.datetime(2003, 2, 1, 0, 0):
         s1     s2       s3 
f1  10000.0  192.0   9421.0       
f2   4000.0    NaN   1340.0, 

, datetime.datetime(2003, 3, 1, 0, 0):      
         s1     s2       s3 
f1  24400.0  102.0   9121.0       
f2  14000.0    3.3   1340.0
}

STRATEGY 1: numpy only

for i,key in enumerate(keySort):
    if i:
         dictDelta[key]= dict_MVAdj[keySort[i]].as_matrix()-dict_MVAdj[keySort[i-1]].as_matrix()
         dictBinPos[key] = (np.where(dictDelta[key]>0,1,0)).sum(axis = 1)

STRATEGY 2: data dictation + pandas

for i,key in enumerate(keySort):
    if i:
         dictDelta[key] = dict_MVAdj[keySort[i]].subtract(dict_MVAdj[keySort[i-1]])
         dictBinPos[key]= pd.DataFrame(((dictDelta[key]>0).astype(int)).sum(axis=1))

         ''' or alternatively this second strategy to binarize but the perfomance is similar '''
         dictBinPos[key]=  = dictDelta[key].where(dictDelta[key]>0, 0).divide(np.abs(dictDelta[key])).sum(axis=1)

STRATEGY 3: pandas only

Dataframes Dataframe,

MVAdj = pd.concat(dict_MVAdj)
D_MVAdj = MVAdj.groupby(level=1).diff(1)
D_binPos = pd.DataFrame(((D_MVAdl>0).astype(int)).sum(axis=1))

1 2 3 (). , 1 , 2 ( ) 4-5 , 3 8-10 . 3, , -, where ( -, where) pandas.

+4

python numpy pandas

SirC 02 . '17 16:35

:

222

pandas

22

Is there a difference in calculation for Numpy vs Pandas?

or similar:

1553

Renaming columns in pandas

1462

How to iterate over rows in a DataFrame in Pandas?

1419

Select rows from a DataFrame based on values in a column in pandas

1033

Remove column from panda DataFrame

889

Select multiple columns in pandas data frame

879

Get a list of pandas DataFrame column headers

873

Big data workflows using pandas

815

Adding a new column to an existing DataFrame in Python pandas

695

Add one row to pandas DataFrame

300

Convert pandas dataframe to NumPy array

Performance of numpy vs pandas

a common problem

STRATEGY 1: numpy only

STRATEGY 2: data dictation + pandas

STRATEGY 3: pandas only

More articles: