I did some initial testing, and it seems like most of your time has been spent generating the actual interpolation functions. It seems that only vectorization will accelerate it to a ton, but it facilitates parallelization (which increases the speed). Here is an example.
import numpy as np from scipy import interpolate import timeit import multiprocessing def myfunc(arg): x, y = arg return interpolate.PchipInterpolator(x, y, extrapolate=True) p = multiprocessing.Pool(processes=8) axes_have = np.ones((2, 72, 2001)) axes_have *= np.linspace(0, 100, 2001)[None, None, :] axes_have += np.linspace(-0.3, 0.3, 144).reshape((2, 72))[:, :, None] arr = np.sin(axes_have) arr *= np.random.random((2, 72))[:, :, None] axis_want = np.linspace(0, 100, 201) arr_ip = np.zeros((2, 72, 201)) s_time1 = timeit.default_timer() for i in range(arr.shape[0]): for j in range(arr.shape[1]): ip_func = interpolate.PchipInterpolator(axes_have[i, j, :], arr[i, j, :], extrapolate=True) arr_ip[i, j, :] = ip_func(axis_want) elapsed1 = timeit.default_timer() - s_time1 s_time2 = timeit.default_timer() flatten_have = [y for x in axes_have for y in x] flatten_arr = [y for x in arr for y in x] interp_time = timeit.default_timer() funcs = p.map(myfunc, zip(flatten_have, flatten_arr)) interp_elapsed = timeit.default_timer() - interp_time arr_ip = np.asarray([func(axis_want) for func in funcs]).reshape(2, 72, 201) elapsed2 = timeit.default_timer() - s_time2 print("Elapsed 1: {}".format(elapsed1)) print("Elapsed 2: {}".format(elapsed2)) print("Elapsed interp: {}".format(interp_elapsed))
Typical results (note that the vectorized implementation is almost exactly equal to the speed without parallelization and that I have 2 cores, so the execution time should be approximately (the original time / number of cores)):
Elapsed 1: 10.4025919437 Elapsed 2: 5.11409401894 Elapsed interp: 5.10987687111
Donβt get me wrong, there may be an algorithmic way to do it faster, but it seems like the easiest way to immediately accelerate, since the problem is awkwardly parallel.