Smooth.spline (): the installed model does not match the user-specified degree of freedom

Here is the code I ran

fun <- function(x) {1 + 3*sin(4*pi*x-pi)} set.seed(1) num.samples <- 1000 x <- runif(num.samples) y <- fun(x) + rnorm(num.samples) * 1.5 fit <- smooth.spline(x, y, all.knots=TRUE, df=3) 

Despite df=3 , when I checked the installed model, the output was

 Call: smooth.spline(x = x, y = y, df = 3, all.knots = TRUE) Smoothing Parameter spar= 1.499954 lambda= 0.002508571 (26 iterations) Equivalent Degrees of Freedom (Df): 9.86422 

Can anyone help? Thanks!

+5
source share
1 answer

Please note that from R-3.4.0 (2017-04-21) smooth.spline can accept the direct specification λ for the newly added argument lambda . But during the evaluation, it will be converted to an internal spar . Therefore, the following answer is not affected.


The smoothing parameter λ / spar lies in the center of smoothness control

Smoothness is controlled by the smoothing parameter λ . smooth.spline() uses the spar internal smoothing parameter, not λ :

 spar = s0 + 0.0601 * log(λ) 

Such a logarithm conversion is necessary to perform unlimited minimization, such as GCV / CV. The user can specify spar to indirectly indicate λ . When spar grows linearly, λ will grow exponentially. Thus, the use of a large spar value is rarely required.

The degree of freedom df also defined in terms of λ :

edf

where X is a model matrix with a B-spline basis, and S is a penalty matrix.

You can check your relationship with your dataset:

 spar <- seq(1, 2.5, by = 0.1) a <- sapply(spar, function (spar_i) unlist(smooth.spline(x, y, all.knots=TRUE, spar = spar_i)[c("df","lambda")])) 

Let the sketch df ~ spar , λ ~ spar and log(λ) ~ spar :

 par(mfrow = c(1,3)) plot(spar, a[1, ], type = "b", main = "df ~ spar", xlab = "spar", ylab = "df") plot(spar, a[2, ], type = "b", main = "lambda ~ spar", xlab = "spar", ylab = "lambda") plot(spar, log(a[2,]), type = "b", main = "log(lambda) ~ spar", xlab = "spar", ylab = "log(lambda)") 

plot

Note the radical increase in λ with spar , the linear relationship between log(λ) and spar and the relatively smooth relationship between df and spar .


smooth.spline() setting iterations for spar

If we manually specify the spar value, like what we did in sapply() , no iterations are performed to select spar ; otherwise, smooth.spline() requires the repetition of a series of spar values. If we

  • specify cv = TRUE / FALSE , setting iterations is aimed at minimizing the CV / GCV score;
  • specify df = mydf , while iterations are aimed at minimizing (df(spar) - mydf) ^ 2 .

Minimizing GCV is easy. We do not care about the evaluation of GCV, but we care about the corresponding spar . In contrast, when minimizing (df(spar) - mydf)^2 we often care about the df value at the end of the iteration, not the spar ! But bearing in mind that this is a minimization problem, we never guarantee that the final df matches our target mydf value.


Why do you put df = 3 but get df = 9.864?

The final iteration can either mean reaching a minimum, or reaching the search boundary, or achieving the maximum number of iterations.

We are far from the maximum limit of iterations (500 by default); but we did not hit the minimum. Well, we could reach the border.

Do not focus on df , think about spar .

 smooth.spline(x, y, all.knots=TRUE, df=3)$spar # 1.4999 

According to ?smooth.spline , by default smooth.spline() does a spar search between [-1.5, 1.5] . Ie, when you put df = 3 , the minimization ends at the search border, and does not press df = 3 .

Look at our graph of the relationship between df and spar , again. From the figure, it looks like we need some spar value of about 2 to bring df = 3 .

Use the control.spar argument:

 fit <- smooth.spline(x, y, all.knots=TRUE, df=3, control.spar = list(high = 2.5)) # Smoothing Parameter spar= 1.859066 lambda= 0.9855336 (14 iterations) # Equivalent Degrees of Freedom (Df): 3.000305 

Now you see that you are ending with df = 3 . And we need spar = 1.86 .


Best offer: do not use all.knots = TRUE

Look, you have 1000 data. With all.knots = TRUE you will use 1000 parameters. Wanting to get df = 3 , it follows that 997 out of 1000 parameters are suppressed. Imagine how large a λ therefore the spar you need!

Try using the floating regression plugin. Suppressing 200 parameters to 3 is certainly much simpler:

 fit <- smooth.spline(x, y, nknots = 200, df=3) ## using 200 knots # Smoothing Parameter spar= 1.317883 lambda= 0.9853648 (16 iterations) # Equivalent Degrees of Freedom (Df): 3.000386 

Now you get df = 3 without spar .

+4
source

Source: https://habr.com/ru/post/1247655/


All Articles