Extrapolation - awk

Question

Extrapolation - awk

I need help with the following: I have a data file (columns are separated by the table "\ t"), for example, data.dat

  # y1 y2 y3 y4 17.1685 21.6875 20.2393 26.3158

These are x values of 4 points for linear correspondence. Four values of y are constant: 0, 200, 400, 600 .

I can create a linear correspondence of pairs of points (x,y) : (x1,y1)=(17.1685,0), (x2,y2)=(21.6875,200), (x3,y3)=(20.2393,400), (x4,y4)=(26.3158,600) .

Now I would like to make a linear landing on three of these points paris, (x1,y1), (x2,y2), (x3,y3) and (x2,y2), (x3,y3), (x4,y4) and (x1,y1), (x3,y3), (x4,y4) and (x1,y1), (x2,y2), (x4,y4).

If I have three points with linear fit, I would like to know the x value of the extrapolated point from these three set points.

I have this awk code:

 #!/usr/bin/awk -f BEGIN{ z[1] = 0; z[2] = 200; z[3] = 400; z[4] = 600; } { split($0,str,"\t"); n = 0.0; for(i=1; i<=NF; i++) { centr[i] = str[i]; n += 1.0; # printf("%d\t%f\t%.1f\t",i,centr[i],z[i]); } # print ""; if (n > 2) { lsq(n,z,centr); } } function lsq(n,x,y) { sx = 0.0 sy = 0.0 sxx = 0.0 syy = 0.0 sxy = 0.0 eps = 0.0 for (i=1;i<=n;i++) { sx += x[i] sy += y[i] sxx += x[i]*x[i] sxy += x[i]*y[i] syy += y[i]*y[i] } if ( (n==0) || ((n*sxx-sx*sx)==0) ) { next; } # print "number of data points = " n; a = (sxx*sy-sxy*sx)/(n*sxx-sx*sx) b = (n*sxy-sx*sy)/(n*sxx-sx*sx) for(i=1;i<=n;i++) { ycalc[i] = a+b*x[i] dy[i] = y[i]-ycalc[i] eps += dy[i]*dy[i] } print "# Intercept =\t"a" print "# Slope =\t"b" for (i=1;i<=n;i++) { printf("%8g %8g %8g \n",x[i],y[i],ycalc[i]) } } # function lsq()

So,

  If we extrapolate to the place of 4th 0 17.1685 <--(x1,y1) 200 21.6875 <--(x2,y2) 400 20.2393 <--(x3,y3) 600 22.7692 <<< (x4 = 600,y1 = 22.7692) If we extrapolate to the place of 3th 0 17.1685 <--(x1,y1) 200 21.6875 <--(x2,y2) 400 23.6867 <<< (x3 = 400,y3 = 23.6867) 600 26.3158 <--(x4,y4) 0 17.1685 200 19.35266 <<< 400 20.2393 600 26.3158 0 18.1192 <<< 200 21.6875 400 20.2393 600 26.3158

My current output is as follows:

 $> ./prog.awk data.dat # Intercept = 17.4537 # Slope = 0.0129968 0 17.1685 17.4537 200 21.6875 20.0531 400 20.2393 22.6525 600 26.3158 25.2518

+6

linux shell awk

user1116360 Jan 18 '12 at 13:54

source share

1 answer

Jonathan leffler · Answer 1 · 2012-01-18T16:57:44+0000

Assuming that the kernel calculation in the lsq function is fine (it looks correct, but I haven’t studied it), this gives you the slope and intercept for the smallest sum of squares that are best suited for the input data set (parameters x, y, n). I'm not sure I understand the end of the function.

In order to “take three points and calculate the fourth” problem, the easiest way is to create 4 subsets (logically, by removing one point from a set of four on each of the four calls) and recalculating.

You need to call another function that takes line data (slope, intercept) from lsq and interpolates (extrapolates) the value with a different y value. This is a direct calculation ( x = m * y + c ), but you need to determine what y value is not in the set of 3 that you pass.

You can “optimize” (which means “complicate”) this scheme by discarding one value at a time from the values of “sum of squares” and “sum” and “sum of products”, recalculation of the slope, interception, and then again calculating the missing point.

(I will also note that usually these will be x coordinates with fixed values of 0, 200, 400, 600, and the y-coordinates will be read values. However, this is only a matter of orientation, so this is not critical.)

Here is at least plausible working code. Since awk automatically breaks into empty space, you do not need to divide into tabs specifically; The reading cycle takes this into account.

The code needs serious refactoring; there is a ton of repetition in it - however I also have work that I have to do.

 #!/usr/bin/awk -f BEGIN{ z[1] = 0; z[2] = 200; z[3] = 400; z[4] = 600; } { for (i = 1; i <= NF; i++) { centr[i] = $i } if (NF > 2) { lsq(NF, z, centr); } } function lsq(n, x, y) { if (n == 0) return sx = 0.0 sy = 0.0 sxx = 0.0 syy = 0.0 sxy = 0.0 for (i = 1; i <= n; i++) { print "x[" i "] = " x[i] ", y[" i "] = " y[i] sx += x[i] sy += y[i] sxx += x[i]*x[i] sxy += x[i]*y[i] syy += y[i]*y[i] } if ((n*sxx - sx*sx) == 0) return # print "number of data points = " n; a = (sxx*sy-sxy*sx)/(n*sxx-sx*sx) b = (n*sxy-sx*sy)/(n*sxx-sx*sx) for (i = 1; i <= n; i++) { ycalc[i] = a+b*x[i] } print "# Intercept = " a print "# Slope = " b print "Line: x = " a " + " b " * y" for (i = 1; i <= n; i++) { printf("x = %8g, yo = %8g, yc = %8g\n", x[i], y[i], ycalc[i]) } print "" print "Different subsets\n" for (drop = 1; drop <= n; drop++) { print "Subset " drop sx = sy = sxx = sxy = syy = 0 j = 1 for (i = 1; i <= n; i++) { if (i == drop) continue print "x[" j "] = " x[i] ", y[" j "] = " y[i] sx += x[i] sy += y[i] sxx += x[i]*x[i] sxy += x[i]*y[i] syy += y[i]*y[i] j++ } if (((n-1)*sxx - sx*sx) == 0) continue a = (sxx*sy-sxy*sx)/((n-1)*sxx-sx*sx) b = ((n-1)*sxy-sx*sy)/((n-1)*sxx-sx*sx) print "Line: x = " a " + " b " * y" xt = x[drop] yt = a + b * xt; print "Interpolate: x = " xt ", y = " yt } }

Since awk does not provide an easy way to pass multiple values from a function and does not provide structures other than arrays (sometimes associative), this is not the best language for this task. On the other hand, it can be done to do the job. You may be able to calculate the least squares calculation in a function that returns an array containing the slope and intercept, and then use this. It's your turn to explore options.

Given the lsq.awk script and the lsq.awk input file lsq.data , I get the output:

 $ cat lsq.data 17.1685 21.6875 20.2393 26.3158 $ awk -f lsq.awk lsq.data x[1] = 0, y[1] = 17.1685 x[2] = 200, y[2] = 21.6875 x[3] = 400, y[3] = 20.2393 x[4] = 600, y[4] = 26.3158 # Intercept = 17.4537 # Slope = 0.0129968 Line: x = 17.4537 + 0.0129968 * y x = 0, yo = 17.1685, yc = 17.4537 x = 200, yo = 21.6875, yc = 20.0531 x = 400, yo = 20.2393, yc = 22.6525 x = 600, yo = 26.3158, yc = 25.2518 Different subsets Subset 1 x[1] = 200, y[1] = 21.6875 x[2] = 400, y[2] = 20.2393 x[3] = 600, y[3] = 26.3158 Line: x = 18.1192 + 0.0115708 * y Interpolate: x = 0, y = 18.1192 Subset 2 x[1] = 0, y[1] = 17.1685 x[2] = 400, y[2] = 20.2393 x[3] = 600, y[3] = 26.3158 Line: x = 16.5198 + 0.0141643 * y Interpolate: x = 200, y = 19.3526 Subset 3 x[1] = 0, y[1] = 17.1685 x[2] = 200, y[2] = 21.6875 x[3] = 600, y[3] = 26.3158 Line: x = 17.7985 + 0.0147205 * y Interpolate: x = 400, y = 23.6867 Subset 4 x[1] = 0, y[1] = 17.1685 x[2] = 200, y[2] = 21.6875 x[3] = 400, y[3] = 20.2393 Line: x = 18.163 + 0.007677 * y Interpolate: x = 600, y = 22.7692 $

Edit: in the previous version of the answer, the subsets were multiplied by n instead of (n-1) . The values in the revised release seem to be consistent with what you expect. The remaining problems are presentation, not computational.

Extrapolation - awk

More articles: