Performance of numpy vs pandas

I process and work with fairly large data frames in pandas. Although I am not necessarily interested in maximizing the performance of my code as efficiently as possible, I am extremely surprised by the significant performance gap that I observe for different versions of the same task when I use only pandas, a hybrid version that combines pandas> + dictand only numpy. I am not sure about pandas since I am in numpy, and therefore I would like to know if pandas is internally slower than numpy for certain tasks (but more elegant when you write code), or it may be caused in part by poor use of pandas .

a common problem

I have a matrix of NxMfloats on a monthly scale (where N and M are about 1000 and 5000, respectively). For exogenous reasons, these matrices are represented as a pandas frame, one for each timestamp. For each date, I need to subtract the matrix at this time using an elementary date-1element. Then I need to binarize this result in order to have 1 if the element has a difference of 1, 0 otherwise. Finally, for each date, I need to sum along axis=1, i.e. Along the rows of this binarized matrix. It is henceforth dict_MVAdja dictionary, where the keys are dates and values ​​- these are NxM dataframes. Data frames may contain values NaN:

dict_MVAdj  -> {datetime.datetime(2003, 2, 1, 0, 0):
         s1     s2       s3 
f1  10000.0  192.0   9421.0       
f2   4000.0    NaN   1340.0, 

, datetime.datetime(2003, 3, 1, 0, 0):      
         s1     s2       s3 
f1  24400.0  102.0   9121.0       
f2  14000.0    3.3   1340.0
} 

STRATEGY 1: numpy only

for i,key in enumerate(keySort):
    if i:
         dictDelta[key]= dict_MVAdj[keySort[i]].as_matrix()-dict_MVAdj[keySort[i-1]].as_matrix()
         dictBinPos[key] = (np.where(dictDelta[key]>0,1,0)).sum(axis = 1)

STRATEGY 2: data dictation + pandas

for i,key in enumerate(keySort):
    if i:
         dictDelta[key] = dict_MVAdj[keySort[i]].subtract(dict_MVAdj[keySort[i-1]])
         dictBinPos[key]= pd.DataFrame(((dictDelta[key]>0).astype(int)).sum(axis=1))

         ''' or alternatively this second strategy to binarize but the perfomance is similar '''
         dictBinPos[key]=  = dictDelta[key].where(dictDelta[key]>0, 0).divide(np.abs(dictDelta[key])).sum(axis=1)

STRATEGY 3: pandas only

Dataframes Dataframe,

MVAdj = pd.concat(dict_MVAdj)
D_MVAdj = MVAdj.groupby(level=1).diff(1)
D_binPos = pd.DataFrame(((D_MVAdl>0).astype(int)).sum(axis=1))

1 2 3 (). , 1 , 2 ( ) 4-5 , 3 8-10 . 3, , -, where ( -, where) pandas.

+4

Source: https://habr.com/ru/post/1682877/


All Articles