Speeding up MSD computing in Python

This is a call to the community to find out if anyone has an idea to improve the speed of this MSD calculation implementation. This is mainly related to the implementation of this blog: http://damcb.com/mean-square-disp.html

Currently, the current implementation takes about 9 seconds for a 2D path of 5,000 points. This is really too much if you need to calculate many trajectories ...

I did not try to parallelize it (using multiprocess or joblib ), but I feel that creating new processes will be too difficult for such an algorithm.

Here is the code:

 import os import matplotlib import matplotlib.pyplot as plt import pandas as pd import numpy as np # Parameters N = 5000 max_time = 100 dt = max_time / N # Generate 2D brownian motion t = np.linspace(0, max_time, N) xy = np.cumsum(np.random.choice([-1, 0, 1], size=(N, 2)), axis=0) traj = pd.DataFrame({'t': t, 'x': xy[:,0], 'y': xy[:,1]}) print(traj.head()) # Draw motion ax = traj.plot(x='x', y='y', alpha=0.6, legend=False) # Set limits ax.set_xlim(traj['x'].min(), traj['x'].max()) ax.set_ylim(traj['y'].min(), traj['y'].max()) 

And the conclusion:

  txy 0 0.000000 -1 -1 1 0.020004 -1 0 2 0.040008 -1 -1 3 0.060012 -2 -2 4 0.080016 -2 -2 

enter image description here

 def compute_msd(trajectory, t_step, coords=['x', 'y']): tau = trajectory['t'].copy() shifts = np.floor(tau / t_step).astype(np.int) msds = np.zeros(shifts.size) msds_std = np.zeros(shifts.size) for i, shift in enumerate(shifts): diffs = trajectory[coords] - trajectory[coords].shift(-shift) sqdist = np.square(diffs).sum(axis=1) msds[i] = sqdist.mean() msds_std[i] = sqdist.std() msds = pd.DataFrame({'msds': msds, 'tau': tau, 'msds_std': msds_std}) return msds # Compute MSD msd = compute_msd(traj, t_step=dt, coords=['x', 'y']) print(msd.head()) # Plot MSD ax = msd.plot(x="tau", y="msds", logx=True, logy=True, legend=False) ax.fill_between(msd['tau'], msd['msds'] - msd['msds_std'], msd['msds'] + msd['msds_std'], alpha=0.2) 

And the conclusion:

  msds msds_std tau 0 0.000000 0.000000 0.000000 1 1.316463 0.668169 0.020004 2 2.607243 2.078604 0.040008 3 3.891935 3.368651 0.060012 4 5.200761 4.685497 0.080016 

enter image description here

And some profiling:

 %timeit msd = compute_msd(traj, t_step=dt, coords=['x', 'y']) 

Give this:

 1 loops, best of 3: 8.53 s per loop 

Any idea?

+5
source share
5 answers

The MSD calculations mentioned so far are all O (N ** 2), where N is the number of time steps. Using FFT, this can be reduced to O (N * log (N)). See this question and answer for an explanation and implementation in python.

EDIT: A small landmark (I also added this test to this answer ): Create a path using

 r = np.cumsum(np.random.choice([-1., 0., 1.], size=(N, 3)), axis=0) 

At N = 100.000 we get

 $ %timeit msd_straight_forward(r) 1 loops, best of 3: 2min 1s per loop $ %timeit msd_fft(r) 10 loops, best of 3: 253 ms per loop 
+2
source

He did some line-by-line profiling, and it seems that pandas is making it slow. This pure numpy version is about 14 times faster:

 def compute_msd_np(xy, t, t_step): shifts = np.floor(t / t_step).astype(np.int) msds = np.zeros(shifts.size) msds_std = np.zeros(shifts.size) for i, shift in enumerate(shifts): diffs = xy[:-shift if shift else None] - xy[shift:] sqdist = np.square(diffs).sum(axis=1) msds[i] = sqdist.mean() msds_std[i] = sqdist.std(ddof=1) msds = pd.DataFrame({'msds': msds, 'tau': t, 'msds_std': msds_std}) return msds 
+3
source

Adding to moarningsun answer above:

  • you can speed up the use of numexpr
  • if you plan on logging the MSD anyway, you don't need to calculate it every time

     import numpy as np import numexpr def logSpaced(L, pointsPerDecade=15): """Generate an array of log spaced integers smaller than L""" nbdecades = np.log10(L) return np.unique(np.logspace( start=0, stop=nbdecades, num=nbdecades * pointsPerDecade, base=10, endpoint=False ).astype(int)) def compute_msd(xy, pointsPerDecade=15): dts = logSpaced(len(xy), pointsPerDecade) msd = np.zeros(len(idts)) msd_std = np.zeros(len(idts)) for i, dt in enumerate(dts): sqdist = numexpr.evaluate( '(ab)**2', {'a': xy[:-dt], 'b':xy[dt:]} ).sum(axis=-1) msd[i] = sqdist.mean() msd_std[i] = sqdist.std(ddof=1) msds = pd.DataFrame({'msds': msd, 'tau': dt, 'msds_std': msd_std}) return msds 
+3
source

With comments, I developed this function:

 def get_msd(traj, dt, with_nan=True): shifts = np.arange(1, len(traj), dtype='int') msd = np.empty((len(shifts), 2), dtype='float') msd[:] = np.nan msd[:, 1] = shifts * dt for i, shift in enumerate(shifts): diffs = traj[:-shift] - traj[shift:] if with_nan: diffs = diffs[~np.isnan(diffs).any(axis=1)] diffs = np.square(diffs).sum(axis=1) if len(diffs) > 0: msd[i, 0] = np.mean(diffs) msd = pd.DataFrame(msd) msd.columns = ["msd", "delay"] msd.set_index('delay', drop=True, inplace=True) msd.dropna(inplace=True) return msd 

With the following features:

  • A numpy array is required as input to the trajectory.
  • It returns pandas.DataFrame with almost no overlay.
  • with_nan allows with_nan to handle a trajectory containing NaN values, but it adds a lot of overhead (over 100%), so I put it as a function parameter.
  • It can deal with multi-dimensional trajectories (1D, 2D, 3D, etc.)

Some profiling:

 $ print(traj.shape) (2108, 2) $ %timeit get_msd(traj, with_nan=True, dt=0.1) 10 loops, best of 3: 143 ms per loop $ %timeit get_msd(traj, with_nan=False, dt=0.1) 10 loops, best of 3: 68 ms per loop 
+1
source

This may not be the topic, but the MSD should not be calculated as an average, as on line 37:

 msds[i] = sqdist.mean() 

Taking as mean=N

You should divide by:

 msds[i] = sqdist/N-1 // for lag1 

Then:

 msds[i] = sqdist/N-2 // for lag2 .... msds[i] = sqdist/Nn // for lag n 

And so on.

As a result, you will not get the standard deviation, just MSD for one path

0
source

Source: https://habr.com/ru/post/1233128/


All Articles