Remove data points below curve with python

Question

Remove data points below curve with python

I need to compare some theoretical data with real data in python. The theoretical data come from solving the equation. To improve the comparative, I would like to remove data points that are far from theoretical. I mean, I want to remove the points below and above the red dashed lines in the figure (made with matplotlib). Data points and theoretical curves

Both theoretical curves and data points are arrays of different lengths.

I can try to delete the points in a rough way, for example: the first top point can be detected with:

data2[(data2.redshift<0.4)&data2.dmodulus>1] rec.array([('1997o', 0.374, 1.0203223485103787, 0.44354759972859786)], dtype=[('SN_name', '|S10'), ('redshift', '<f8'), ('dmodulus', '<f8'), ('dmodulus_error', '<f8')])

But I would like to use a less rude image.

So can anyone help me find an easy way to remove the problem points?

Thanks!

+6

python scipy matplotlib scientific-computing

Illa rivro losada Oct 31 '11 at 19:52

source share

4 answers

Yann · Answer 1 · 2011-11-01T13:03:06+0000

This may be redundant and based on your comment.

Both theoretical curves and data points are arrays of different lengths.

I would do the following:

Truncate the data set so that its x values are within the maximum and minimum values of the theoretical set.
Interpolate the theoretical curve using scipy.interpolate.interp1d and the above x data values. The reason for step (1) is that interp1d constraints are interp1d .
Use numpy.where to find y data values that are outside the range of acceptable theory values.
DONT discard these values as suggested in the comments and other answers. If you want for clarity, specify them by building "inliners" one color and "outliers" of another color.

Here is a script that is close to what you are looking for, I think. He hopefully helps you accomplish what you want:

 import numpy as np import scipy.interpolate as interpolate import matplotlib.pyplot as plt # make up data def makeUpData(): '''Make many more data points (x,y,yerr) than theory (x,y), with theory yerr corresponding to a constant "sigma" in y, about x,y value''' NX= 150 dataX = (np.random.rand(NX)*1.1)**2 dataY = (1.5*dataX+np.random.rand(NX)**2)*dataX dataErr = np.random.rand(NX)*dataX*1.3 theoryX = np.arange(0,1,0.1) theoryY = theoryX*theoryX*1.5 theoryErr = 0.5 return dataX,dataY,dataErr,theoryX,theoryY,theoryErr def makeSameXrange(theoryX,dataX,dataY): ''' Truncate the dataX and dataY ranges so that dataX min and max are with in the max and min of theoryX. ''' minT,maxT = theoryX.min(),theoryX.max() goodIdxMax = np.where(dataX<maxT) goodIdxMin = np.where(dataX[goodIdxMax]>minT) return (dataX[goodIdxMax])[goodIdxMin],(dataY[goodIdxMax])[goodIdxMin] # take 'theory' and get values at every 'data' x point def theoryYatDataX(theoryX,theoryY,dataX): '''For every dataX point, find interpolated thoeryY value. theoryx needed for interpolation.''' f = interpolate.interp1d(theoryX,theoryY) return f(dataX[np.where(dataX<np.max(theoryX))]) # collect valid points def findInlierSet(dataX,dataY,interpTheoryY,thoeryErr): '''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return valid indicies.''' withinUpper = np.where(dataY<(interpTheoryY+theoryErr)) withinLower = np.where(dataY[withinUpper] >(interpTheoryY[withinUpper]-theoryErr)) return (dataX[withinUpper])[withinLower],(dataY[withinUpper])[withinLower] def findOutlierSet(dataX,dataY,interpTheoryY,thoeryErr): '''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return valid indicies.''' withinUpper = np.where(dataY>(interpTheoryY+theoryErr)) withinLower = np.where(dataY<(interpTheoryY-theoryErr)) return (dataX[withinUpper],dataY[withinUpper], dataX[withinLower],dataY[withinLower]) if __name__ == "__main__": dataX,dataY,dataErr,theoryX,theoryY,theoryErr = makeUpData() TruncDataX,TruncDataY = makeSameXrange(theoryX,dataX,dataY) interpTheoryY = theoryYatDataX(theoryX,theoryY,TruncDataX) inDataX,inDataY = findInlierSet(TruncDataX,TruncDataY,interpTheoryY, theoryErr) outUpX,outUpY,outDownX,outDownY = findOutlierSet(TruncDataX, TruncDataY, interpTheoryY, theoryErr) #print inlierIndex fig = plt.figure() ax = fig.add_subplot(211) ax.errorbar(dataX,dataY,dataErr,fmt='.',color='k') ax.plot(theoryX,theoryY,'r-') ax.plot(theoryX,theoryY+theoryErr,'r--') ax.plot(theoryX,theoryY-theoryErr,'r--') ax.set_xlim(0,1.4) ax.set_ylim(-.5,3) ax = fig.add_subplot(212) ax.plot(inDataX,inDataY,'ko') ax.plot(outUpX,outUpY,'bo') ax.plot(outDownX,outDownY,'ro') ax.plot(theoryX,theoryY,'r-') ax.plot(theoryX,theoryY+theoryErr,'r--') ax.plot(theoryX,theoryY-theoryErr,'r--') ax.set_xlim(0,1.4) ax.set_ylim(-.5,3) fig.savefig('findInliers.png')

This figure is the result of:

Illa rivro losada · Answer 2 · 2011-11-08T14:52:58+0000

In the end, I use the Yann code:

 def theoryYatDataX(theoryX,theoryY,dataX): '''For every dataX point, find interpolated theoryY value. theoryx needed for interpolation.''' f = interpolate.interp1d(theoryX,theoryY) return f(dataX[np.where(dataX<np.max(theoryX))]) def findOutlierSet(data,interpTheoryY,theoryErr): '''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return valid indicies.''' up = np.where(data.dmodulus > (interpTheoryY+theoryErr)) low = np.where(data.dmodulus < (interpTheoryY-theoryErr)) # join all the index together in a flat array out = np.hstack([up,low]).ravel() index = np.array(np.ones(len(data),dtype=bool)) index[out]=False datain = data[index] dataout = data[out] return datain, dataout def selectdata(data,theoryX,theoryY): """ Data selection: z<1 and +-0.5 LFLRW separation """ # Select data with redshift z<1 data1 = data[data.redshift < 1] # From modulus to light distance: data1.dmodulus, data1.dmodulus_error = modulus2distance(data1.dmodulus,data1.dmodulus_error) # redshift data order data1.sort(order='redshift') # Outliers: distance to LFLRW curve bigger than +-0.5 theoryErr = 0.5 # Theory curve Interpolation to get the same points as data interpy = theoryYatDataX(theoryX,theoryY,data1.redshift) datain, dataout = findOutlierSet(data1,interpy,theoryErr) return datain, dataout

Using these functions, I can finally get:

Thank you all for your help.

tillsten · Answer 3 · 2011-10-31T21:15:24+0000

Just look at the difference between the red curve and the dots, if it is larger than the difference between the red curve and the dashed red curve, delete it.

 diff=np.abs(points-red_curve) index= (diff>(dashed_curve-redcurve)) filtered=points[index]

But please take a serious comment from NickLH. Your data looks good without any filtering, your "outlieres" have a very big error and will not greatly affect the size.

Dan · Answer 4 · 2011-11-02T15:08:23+0000

Either you can use numpy.where () to determine which xy pairs match your build criteria, or perhaps enumerate to do almost the same thing. Example:

 x_list = [ 1, 2, 3, 4, 5, 6 ] y_list = ['f','o','o','b','a','r'] result = [y_list[i] for i, x in enumerate(x_list) if 2 <= x < 5] print result

I am sure that you can change the conditions so that in the above example, "2" and "5" are functions of your curves

Remove data points below curve with python

More articles: