I want to calculate the quantile / percentiles on a Pandas Dataframe. However, the function is very slow. I repeated it with Numpy, and I found that calculating it in Pandas takes almost 10,000 times longer!
Does anyone know why this is so? Should I calculate it using Numpy and then create a new DataFrame instead of using Pandas?
See my code below:
import time
import pandas as pd
import numpy as np
q = np.array([0.1,0.4,0.6,0.9])
data = np.random.randn(10000, 4)
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd'])
time1 = time.time()
pandas_quantiles = df.quantile(q, axis=1)
time2 = time.time()
print 'Pandas took %0.3f ms' % ((time2-time1)*1000.0)
time1 = time.time()
numpy_quantiles = np.percentile(data, q*100, axis=1)
time2 = time.time()
print 'Numpy took %0.3f ms' % ((time2-time1)*1000.0)
print (pandas_quantiles.values == numpy_quantiles).all()
Johan source
share