Effective 1D linear regression for each element of a 3D numpy array

Question

Effective 1D linear regression for each element of a 3D numpy array

I have three-dimensional stacks of masked arrays. I would like to perform linear regression for the values in each row, col (spatial index) along the 0 axis (time). The sizes of these stacks vary, but a typical shape may be (50, 2000, 2000). My spatially limited, but temporary tight test case has the following dimensions:

stack.ma_stack.shape

(1461, 390, 327)

I did a quick test loop on each line, col:

 from scipy.stats.mstats import linregress #Ordinal dates x = stack.date_list_o #Note: idx should be row, col def sample_lstsq(idx): b = stack.ma_stack[:, idx[0], idx[1]] #Note, this is masked stats version slope, intercept, r_value, p_value, std_err = linregress(x, b) return slope out = np.zeros_like(stack.ma_stack[0]) for row in np.arange(stack.ma_stack.shape[1]): for col in np.arange(stack.ma_stack.shape[2]): out[row, col] = sample_lstsq((row, col))

It works (slowly). I know that there should be a more effective approach.

I started playing with index arrays and np.vectorize, but I don't think I'm really going to offer any real improvement. I decided to drop everything on Pandas or try connecting to Cython, but I hope I can use NumPy / SciPy. Or maybe a parallel solution is my best option for increasing productivity?

Also, has anyone compared the linear regression parameters of NumPy / SciPy? I came across the following options, but did not test myself:

scipy.stats.linregress
numpy.linalg.leastsq
numpy.polyfit (city = 1)

I hope that there is an existing approach that offers significant performance improvements without much work to implement. Thanks.

Edited 12/3/13 @ 02: 29

The approach suggested by @HYRY works fine (~ 15 s) for the sample dataset described above, which is unmasked in all dimensions (space and time). However, when transmitting a masked array that passes missing data to np.linalg.leastsq, all masked values are filled with the fill_value value (defualt 1E20), which leads to dummy linear interference.

Fortunately, in the module with the mask of the hidden array, there is np.ma.polyfit (deg = 1), which can handle a 2D y array, such as np.linalg.leastsq. Looking at the source ( https://github.com/numpy/numpy/blob/v1.8.0/numpy/ma/extras.py#L1852 ), ma polyfit is just a shell for np.polyfit that uses a combo mask from x and y masks. This works well for 2D y when the locations of the missing data in y are constant.

Unfortunately, my data has variables missing data locations in space and time. Here is an example from another stack:

 In [146]: stack.ma_stack.shape Out [146]: (57, 1889, 1566)

Sampling a single index returns a timeseries with 6 unspoken values:

 In [147]: stack.ma_stack[:,0,0] Out [147]: masked_array(data = [-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 519.7779541015625 -- -- -- 518.9047241210938 -- -- -- -- -- -- -- 516.6539306640625 516.0836181640625 515.9403686523438 -- -- -- -- 514.85205078125 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --], mask = [ True True True True True True True True True True True True True True True True True True True True True False True True True False True True True True True True True False False False True True True True False True True True True True True True True True True True True True True True True], fill_value = 1e+20)

Sampling elsewhere returns a different amount of unsaturated values from different time intervals:

 In [148]: stack.ma_stack[:,1888,1565] Out[148]: masked_array(data = [-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 729.0936889648438 -- -- -- 724.7155151367188 -- -- -- -- -- -- -- 722.076171875 720.9276733398438 721.9603881835938 -- 720.3294067382812 -- -- 713.9591064453125 709.8037719726562 707.756103515625 -- -- -- 703.662353515625 -- -- -- -- 708.6276245117188 -- -- -- -- --], mask = [ True True True True True True True True True True True True True True True True True True True True True False True True True False True True True True True True True False False False True False True True False False False True True True False True True True True False True True True True True], fill_value = 1e+20)

The minimum number of unoiled values for each index is 6, and the maximum is 45. Thus, each place has at least some masked values.

For reference, my x values (temporary ordinals) are taken without a mask:

 In [150]: stack.date_list_o Out[150]: masked_array(data = [ 733197.64375 733962.64861111 733964.65694444 733996.62361111 733999.64236111 734001.63541667 734033.64305556 734071.64722222 734214.675 734215.65694444 734216.625 734226.64722222 734229.63819444 734232.65694444 734233.67847222 734238.63055556 734238.63055556 734245.65277778 734245.65277778 734255.63125 734255.63125 734307.85 734326.65138889 734348.63888889 734348.63958333 734351.85 734363.70763889 734364.65486111 734390.64722222 734391.63194444 734394.65138889 734407.64652778 734407.64722222 734494.85 734527.85 734582.85 734602.65486111 734664.85555556 734692.64027778 734741.63541667 734747.85 734807.85555556 734884.85555556 734911.65763889 734913.64375 734917.64236111 734928.85555556 734944.71388889 734961.62777778 735016.04583333 735016.62777778 735016.85555556 735036.65347222 735054.04583333 735102.63125 735119.61180556 735140.63263889], mask = False, fill_value = 1e+20)

So, I modify stack.ma_stack and run polyfit:

 newshape = (stack.ma_stack.shape[0], stack.ma_stack.shape[1]*stack.ma_stack.shape[2]) print newshape #(57, 2958174) y = stack.ma_stack.reshape(newshape) p = np.ma.polyfit(x, y, deg=1)

But in the ~ 1500 column, each row in y is “cumulatively” masked, and I get complaints and empty output:

 RankWarning: Polyfit may be poorly conditioned ** On entry to DLASCL, parameter number 4 had an illegal value ...

So it seems that using 2D y with missing data in different places will not work. I need a minimum that uses all available unoiled data in each y column. Perhaps this can be done by carefully compressing x and y and tracking unoiled indexes.

Any other ideas? Pandas is starting to look like it might be a good solution.

Edited 12/3/13 @ 20: 29

@HYRY proposed a solution that works for missing values in time (axis = 0). I had to change a bit to handle missing values in spatial (axes = 1,2). If a particular spatial index has only one unobserved record over time, we certainly do not want to try linear regression. Here is my implementation:

 def linreg(self): #Only compute where we have n_min unmasked values in time n_min = 3 valid_idx = self.ma_stack.count(axis=0).filled(0) >= n_min #Returns 2D array of unmasked columns y = self.ma_stack[:, valid_idx] #Extract mask for axis 0 - invert, True where data is available mask = ~y.mask #Remove masks, fills with fill_value y = y.data #Independent variable is time ordinal x = self.date_list_o x = x.data #Prepare matrices and solve X = np.c_[x, np.ones_like(x)] a = np.swapaxes(np.dot(XT, (X[None, :, :] * mask.T[:, :, None])), 0, 1) b = np.dot(XT, (mask*y)) r = np.linalg.solve(a, bT) #Create output grid with original dimensions out = np.ma.masked_all_like(self.ma_stack[0]) #Fill in the valid indices out[valid_idx] = r[:,0]

Runtime is very fast - only ~ 5-10 seconds for the array sizes discussed here.

+6

python numpy linear-regression

David shean Dec 03 '13 at 5:22

source share

1 answer

Hyry · Accepted Answer · 2013-12-03T07:05:11+0000

If I understand correctly, you want to do a lot of linear regression y = k * x + b , there is only one x , but a lot of y , for each y you want to calculate k and b .

If the form x is 50, y is (50, 1000), you can use numpy.linalg.lstsq , here are some demos:

 import numpy as np x = np.random.rand(50) k = np.random.rand(1000) b = np.random.rand(1000) y = np.outer(x, k) + b + np.random.normal(size=(50, 1000), scale=1e-10) r = np.linalg.lstsq(np.c_[x, np.ones_like(x)], y)[0] print np.allclose(r[0], k) print np.allclose(r[1], b)

for y with the form (50, 2000, 2000), you can change it to (50, 2000 * 2000).

Edit

Here is my solution for a masked array. Formula:

Prepare Y as a 2-dimensional matrix with shape (1889 * 1566, 57), X as a 2-dimensional matrix with shape (57, 2). mask as a bool array with the same shape as Y, True means that the value in Y is available.

Compute an array a with form (1889 * 1566, 2, 2), b with form (1889 * 1566, 2), then call numpy.linalg.solve(a, b) , here is some demo code:

 import numpy as np N = 50 M = 1000 x = np.random.rand(N) X = np.c_[x, np.ones_like(x)] beta = np.random.rand(M, 2) Y = np.dot(beta, XT) Y += np.random.normal(scale=0.1, size=Y.shape) mask = np.random.randint(0, 2, size=Y.shape).astype(np.bool) a = np.swapaxes(np.dot(XT, (X[None, :, :] * mask[:, :, None])), 0, 1) b = np.dot(XT, (mask*Y).T) beta2 = np.linalg.solve(a, bT) i = 123 print "real:", beta[i] print "by solve:", beta2[i] m = mask[i] x2 = X[m] y2 = Y[i, m] print "by lstsq:", np.linalg.lstsq(x2, y2)[0]

output of the 123rd coefficient:

 real: [ 0.35813131 0.29736779] by solve: [ 0.38088499 0.30382547] by lstsq: [ 0.38088499 0.30382547]

You can also calculate array a by following the code, I think it will use less memory than the method above:

 a2 = np.empty((M, 2, 2)) xm = mask * x a2[:, 0, 0] = (xm*xm).sum(1) a2[:, 1, 0] = (xm*mask).sum(1) a2[:, 0, 1] = a2[:, 1, 0] a2[:, 1, 1] = (mask).sum(1) print np.allclose(a2, a)

Effective 1D linear regression for each element of a 3D numpy array

More articles: