7.4 Conceptual Exercises

  1. … cubic regression spline with a knot at \(x = \xi\) with basis functions

\[x, x^{2}, x^{3}, (x-\xi)_{+}^{3}\] where \[(x-\xi)_{+}^{3} = \begin{cases} (x-\xi)^{3} & x > \xi \\ 0 & \text{otherwise} \\ \end{cases}\]

We will show that a function of the form

\[f(x) = \beta_{0} + \beta_{1}x + \beta_{2}x^{2} + \beta_{3}x^{3} + \beta_{4}(x-\xi)_{+}^{3}\]

is indeed a cubic regression spline.

  1. Find a cubic polynomial

\[f_{1}(x) = a_{1} + b_{1}x + c_{1}x^{2} + d_{1}x^{3}\]

such that \(f_{1}(x) = f(x)\) for all \(x \leq \xi\).

Answer

\[\begin{array}{rcl} a_{1} & = & \beta_{0} \\ b_{1} & = & \beta_{1} \\ c_{1} & = & \beta_{2} \\ d_{1} & = & \beta_{3} \\ \end{array}\]

  1. Find a cubic polynomial

\[f_{2}(x) = a_{2} + b_{2}x + c_{2}x^{2} + d_{2}x^{3}\]

such that \(f_{2}(x) = f(x)\) for all \(x > \xi\).

Answer

\[\begin{array}{rcl} a_{2} & = & \beta_{0} - \beta_{4}\xi^{3} \\ b_{2} & = & \beta_{1} + 3\beta_{4}\xi^{2} \\ c_{2} & = & \beta_{2} - 3\beta_{4}\xi \\ d_{2} & = & \beta_{3} + \beta_{4} \\ \end{array}\]

We have now shown that \(f(x)\) is a piecewise polynomial.

  1. Show that \(f(x)\) is continuous at \(\xi\)
Answer

\[\begin{array}{rcl} f_{1}(\xi) & = & f_{2}(\xi) \\ \beta_{0} + \beta_{1}\xi + \beta_{2}\xi^{2} + \beta_{3}\xi^{3} & = & \beta_{0} + \beta_{1}\xi + \beta_{2}\xi^{2} + \beta_{3}\xi^{3} \\ \end{array}\]

  1. Show that \(f'(x)\) is continuous at \(\xi\)
Answer

\[\begin{array}{rcl} f_{1}'(\xi) & = & f_{2}'(\xi) \\ \beta_{1} + 2\beta_{2}\xi + 3\beta_{3}\xi^{2} & = & \beta_{1} + 2\beta_{2}\xi + 3\beta_{3}\xi^{2} \\ \end{array}\]

  1. Show that \(f''(x)\) is continuous at \(\xi\)
Answer

\[\begin{array}{rcl} f_{1}''(\xi) & = & f_{2}''(\xi) \\ 2\beta_{2} + 6\beta_{3}\xi & = & 2\beta_{2} + 6\beta_{3}\xi \\ \end{array}\]

  1. Suppose that a curve \(\hat{g}\) is computed to smoothly fit a set of \(n\) points using the following formula:

\[\hat{g} = \mathop{\mathrm{arg\,min}}_{g} \left( \displaystyle\sum_{i=1}^{n} (y_{i}-g(x_{i}))^{2} + \lambda\displaystyle\int \left[g^{(m)}(x)\right]^{2} \, dx \right)\]

where \(g^{(m)}\) is the \(m^{\text{th}}\) derivative of \(g\) (and \(g^{(0)} = g\)). Describe \(\hat{g}\) in each of the following situations.

  1. \(\lambda = \infty, \quad m = 0\)
Answer heavy penalization of all functions except constants (i.e. horizontal lines)
  1. \(\lambda = \infty, \quad m = 1\)
Answer heavy penalization of all functions except linear functions—i.e. \[\hat{g} = a + bx\]
  1. \(\lambda = \infty, \quad m = 2\)
Answer heavy penalization of all functions except degree-2 polynomials \[\hat{g} = a + bx + cx^{2}\]
  1. \(\lambda = \infty, \quad m = 3\)
Answer heavy penalization of all functions except degree-3 polynomials \[\hat{g} = a + bx + cx^{2} + dx^{3}\]
  1. \(\lambda = 0, \quad m = 2\)
Answer No penalization implies perfect fit of training data.

Answer \[f(x) = \begin{cases} 1 + x, & -2 \leq x \leq 1 \\ 1 + x - 2(x-1)^{2}, & 1 \leq x \leq 2 \\ \end{cases}\]
Answer \[f(x) = \begin{cases} 0, & -2 \leq x < 0 \\ 1, & 0 \leq x \leq 1 \\ x, & 1 \leq x \leq 2 \\ 0, & 2 < x < 3 \\ 3x-3, & 3 \leq x \leq 4 \\ 1, & 4 < x \leq 5 \\ \end{cases}\]
  1. Consider two curves \(\hat{g}_{1}\) and \(\hat{g}_{2}\)

\[\hat{g}_{1} = \mathop{\mathrm{arg\,min}}_{g} \left( \displaystyle\sum_{i=1}^{n} (y_{i}-g(x_{i}))^{2} + \lambda\displaystyle\int \left[g^{(3)}(x)\right]^{2} \, dx \right)\] \[\hat{g}_{2} = \mathop{\mathrm{arg\,min}}_{g} \left( \displaystyle\sum_{i=1}^{n} (y_{i}-g(x_{i}))^{2} + \lambda\displaystyle\int \left[g^{(4)}(x)\right]^{2} \, dx \right)\] where \(g^{(m)}\) is the \(m^{\text{th}}\) derivative of \(g\)

  1. As \(\lambda \rightarrow\infty\), will \(\hat{g}_{1}\) or \(\hat{g}_{2}\) have the smaller training RSS?
Answer \(\hat{g}_{2}\) is more flexible due to the higher order of the penalty term than \(\hat{g}_{1}\), so \(\hat{g}_{2}\) will likely have a lower training RSS.
  1. As \(\lambda \rightarrow\infty\), will \(\hat{g}_{1}\) or \(\hat{g}_{2}\) have the smaller testing RSS?
Answer Generally, \(\hat{g}_{1}\) will perform better on less flexible functions, and \(\hat{g}_{2}\) will perform better on more flexible functions.
  1. For \(\lambda = 0\), will \(\hat{g}_{1}\) or \(\hat{g}_{2}\) have the smaller training RSS?
Answer The penalty terms will be zero for both equations, so training and test terms will be equal.