Purpose of Pandas DataFrame with float32 and float64 slow

Assignments with a Pandas DataFrame with various types of float32 and float64 for some combinations are rather slow, as I do.

The code below sets a DataFrame, calculates Numpy / Scipy on a piece of data, sets a new DataFrame by copying the old one, and assigns the result of the calculation from the new DataFrame:

import pandas as pd import numpy as np from scipy.signal import lfilter N = 1000 M = 1000 def f(dtype1, dtype2): coi = [str(m) for m in range(M)] df = pd.DataFrame([[m for m in range(M)] + ['Hello', 'World'] for n in range(N)], columns=coi + ['A', 'B'], dtype=dtype1) Y = lfilter([1], [0.5, 0.5], df.ix[:, coi]) Y = Y.astype(dtype2) new = pd.DataFrame(df, copy=True) print(new.iloc[0, 0].dtype) print(Y.dtype) new.ix[:, coi] = Y # This statement is considerably slow print(new.iloc[0, 0].dtype) from time import time dtypes = [np.float32, np.float64] for dtype1 in dtypes: for dtype2 in dtypes: print('-' * 10) start_time = time() f(dtype1, dtype2) print(time() - start_time) 

Result of synchronization:

 ---------- float32 float32 float64 10.1998147964 ---------- float32 float64 float64 10.2371120453 ---------- float64 float32 float64 0.864870071411 ---------- float64 float64 float64 0.866265058517 

Here the critical line is new.ix[:, coi] = Y : it is ten times slower for some combinations.

I can understand that redistribution requires some overhead when there is a DataFrame with a floating point and it is assigned float64. But why is overhead so dramatic.

In addition, the combination of float32 and float32 assignments is also slow, and the result is float64, which also bothers me.

0
source share
1 answer

One-column assignments do not change the type, and iterating through the for-loop columns above the columns seems fast enough for assignments not to select a type — both float32 and float64. For type-related assignments, performance is usually twice as bad as the worst-case performance for multiple column assignments.

 import pandas as pd import numpy as np from scipy.signal import lfilter N = 1000 M = 1000 def f(dtype1, dtype2): coi = [str(m) for m in range(M)] df = pd.DataFrame([[m for m in range(M)] + ['Hello', 'World'] for n in range(N)], columns=coi + ['A', 'B'], dtype=dtype1) Y = lfilter([1], [0.5, 0.5], df.ix[:, coi]) Y = Y.astype(dtype2) new = df.copy() print(new.iloc[0, 0].dtype) print(Y.dtype) for n, column in enumerate(coi): # For-loop over columns new! new.ix[:, column] = Y[:, n] print(new.iloc[0, 0].dtype) from time import time dtypes = [np.float32, np.float64] for dtype1 in dtypes: for dtype2 in dtypes: print('-' * 10) start_time = time() f(dtype1, dtype2) print(time() - start_time) 

Result:

 ---------- float32 float32 float32 0.809890985489 ---------- float32 float64 float64 21.4767119884 ---------- float64 float32 float32 20.5611870289 ---------- float64 float64 float64 0.765362977982 
0
source

Source: https://habr.com/ru/post/1242403/


All Articles