Speeding up MSD computing in Python

Question

Speeding up MSD computing in Python

This is a call to the community to find out if anyone has an idea to improve the speed of this MSD calculation implementation. This is mainly related to the implementation of this blog: http://damcb.com/mean-square-disp.html

Currently, the current implementation takes about 9 seconds for a 2D path of 5,000 points. This is really too much if you need to calculate many trajectories ...

I did not try to parallelize it (using multiprocess or joblib ), but I feel that creating new processes will be too difficult for such an algorithm.

Here is the code:

 import os import matplotlib import matplotlib.pyplot as plt import pandas as pd import numpy as np # Parameters N = 5000 max_time = 100 dt = max_time / N # Generate 2D brownian motion t = np.linspace(0, max_time, N) xy = np.cumsum(np.random.choice([-1, 0, 1], size=(N, 2)), axis=0) traj = pd.DataFrame({'t': t, 'x': xy[:,0], 'y': xy[:,1]}) print(traj.head()) # Draw motion ax = traj.plot(x='x', y='y', alpha=0.6, legend=False) # Set limits ax.set_xlim(traj['x'].min(), traj['x'].max()) ax.set_ylim(traj['y'].min(), traj['y'].max())

And the conclusion:

  txy 0 0.000000 -1 -1 1 0.020004 -1 0 2 0.040008 -1 -1 3 0.060012 -2 -2 4 0.080016 -2 -2

 def compute_msd(trajectory, t_step, coords=['x', 'y']): tau = trajectory['t'].copy() shifts = np.floor(tau / t_step).astype(np.int) msds = np.zeros(shifts.size) msds_std = np.zeros(shifts.size) for i, shift in enumerate(shifts): diffs = trajectory[coords] - trajectory[coords].shift(-shift) sqdist = np.square(diffs).sum(axis=1) msds[i] = sqdist.mean() msds_std[i] = sqdist.std() msds = pd.DataFrame({'msds': msds, 'tau': tau, 'msds_std': msds_std}) return msds # Compute MSD msd = compute_msd(traj, t_step=dt, coords=['x', 'y']) print(msd.head()) # Plot MSD ax = msd.plot(x="tau", y="msds", logx=True, logy=True, legend=False) ax.fill_between(msd['tau'], msd['msds'] - msd['msds_std'], msd['msds'] + msd['msds_std'], alpha=0.2)

And the conclusion:

  msds msds_std tau 0 0.000000 0.000000 0.000000 1 1.316463 0.668169 0.020004 2 2.607243 2.078604 0.040008 3 3.891935 3.368651 0.060012 4 5.200761 4.685497 0.080016

And some profiling:

 %timeit msd = compute_msd(traj, t_step=dt, coords=['x', 'y'])

Give this:

 1 loops, best of 3: 8.53 s per loop

Any idea?

+5

python-3.x numpy pandas physics

Hadim Oct 7 '15 at 9:12

source share

5 answers

He did some line-by-line profiling, and it seems that pandas is making it slow. This pure numpy version is about 14 times faster:

 def compute_msd_np(xy, t, t_step): shifts = np.floor(t / t_step).astype(np.int) msds = np.zeros(shifts.size) msds_std = np.zeros(shifts.size) for i, shift in enumerate(shifts): diffs = xy[:-shift if shift else None] - xy[shift:] sqdist = np.square(diffs).sum(axis=1) msds[i] = sqdist.mean() msds_std[i] = sqdist.std(ddof=1) msds = pd.DataFrame({'msds': msds, 'tau': t, 'msds_std': msds_std}) return msds

+3

user2379410 Oct 7 '15 at 11:36

source share

Adding to moarningsun answer above:

you can speed up the use of numexpr

if you plan on logging the MSD anyway, you don't need to calculate it every time

 import numpy as np import numexpr def logSpaced(L, pointsPerDecade=15): """Generate an array of log spaced integers smaller than L""" nbdecades = np.log10(L) return np.unique(np.logspace( start=0, stop=nbdecades, num=nbdecades * pointsPerDecade, base=10, endpoint=False ).astype(int)) def compute_msd(xy, pointsPerDecade=15): dts = logSpaced(len(xy), pointsPerDecade) msd = np.zeros(len(idts)) msd_std = np.zeros(len(idts)) for i, dt in enumerate(dts): sqdist = numexpr.evaluate( '(ab)**2', {'a': xy[:-dt], 'b':xy[dt:]} ).sum(axis=-1) msd[i] = sqdist.mean() msd_std[i] = sqdist.std(ddof=1) msds = pd.DataFrame({'msds': msd, 'tau': dt, 'msds_std': msd_std}) return msds

+3

user2087764 Oct 7 '15 at 13:49

source share

With comments, I developed this function:

 def get_msd(traj, dt, with_nan=True): shifts = np.arange(1, len(traj), dtype='int') msd = np.empty((len(shifts), 2), dtype='float') msd[:] = np.nan msd[:, 1] = shifts * dt for i, shift in enumerate(shifts): diffs = traj[:-shift] - traj[shift:] if with_nan: diffs = diffs[~np.isnan(diffs).any(axis=1)] diffs = np.square(diffs).sum(axis=1) if len(diffs) > 0: msd[i, 0] = np.mean(diffs) msd = pd.DataFrame(msd) msd.columns = ["msd", "delay"] msd.set_index('delay', drop=True, inplace=True) msd.dropna(inplace=True) return msd

With the following features:

A numpy array is required as input to the trajectory.
It returns pandas.DataFrame with almost no overlay.
with_nan allows with_nan to handle a trajectory containing NaN values, but it adds a lot of overhead (over 100%), so I put it as a function parameter.
It can deal with multi-dimensional trajectories (1D, 2D, 3D, etc.)

Some profiling:

 $ print(traj.shape) (2108, 2) $ %timeit get_msd(traj, with_nan=True, dt=0.1) 10 loops, best of 3: 143 ms per loop $ %timeit get_msd(traj, with_nan=False, dt=0.1) 10 loops, best of 3: 68 ms per loop

+1

Hadim Oct 21 '15 at 10:21

source share

This may not be the topic, but the MSD should not be calculated as an average, as on line 37:

 msds[i] = sqdist.mean()

Taking as mean=N

You should divide by:

 msds[i] = sqdist/N-1 // for lag1

Then:

 msds[i] = sqdist/N-2 // for lag2 .... msds[i] = sqdist/Nn // for lag n

And so on.

As a result, you will not get the standard deviation, just MSD for one path

0

Jonathan pacheco Apr 13 '17 at 15:11

source share

thomasfermi · Accepted Answer · 2015-12-11T14:10:11+0000

The MSD calculations mentioned so far are all O (N ** 2), where N is the number of time steps. Using FFT, this can be reduced to O (N * log (N)). See this question and answer for an explanation and implementation in python.

EDIT: A small landmark (I also added this test to this answer ): Create a path using

 r = np.cumsum(np.random.choice([-1., 0., 1.], size=(N, 3)), axis=0)

At N = 100.000 we get

 $ %timeit msd_straight_forward(r) 1 loops, best of 3: 2min 1s per loop $ %timeit msd_fft(r) 10 loops, best of 3: 253 ms per loop

Speeding up MSD computing in Python

More articles: