Why are additional constraint and penalty term equivalent in ridge regression?

Tikhonov regularization (or ridge regression) adds a constraint that $\|\beta\|^2$, the $L^2$-norm of the parameter vector, is not greater than a given value (say $c$). Equivalently, it may solve an unconstrained minimization of the least-squares penalty with $\alpha\|\beta\|^2$ added, where $\alpha$ is a constant (this is the Lagrangian form of the constrained problem).

The above is from Wikipedia. Why is the unconstrained LS with $\alpha\|\beta\|^2$ added to the cost equivalent to the LS problem with an additional constraint that $\|\beta\|^2 \leq c$?

What is the relation between $\alpha$ and $c$?

Thanks!

Solutions Collecting From Web of "Why are additional constraint and penalty term equivalent in ridge regression?"

Let us first define the two problems:

  • Problem 1: \begin{equation} \min_{\beta} ~f_\alpha(\beta):=\frac{1}{2}\Vert y-X\beta\Vert^2 +\alpha\Vert \beta\Vert^2\end{equation}
  • Problem 2: \begin{align} \min_{\beta} ~&\frac{1}{2}\Vert y-X\beta\Vert^2\\ s.t.~&\Vert \beta\Vert^2-c\leq 0\end{align}

The Lagrangian for Problem 2 reads:
\begin{equation}
\mathcal{L}(\beta,\lambda)=\frac{1}{2}\Vert y-X\beta\Vert^2+\lambda (\Vert \beta\Vert^2-c)
\end{equation}
and you probably already see the resemblance with Problem 1 (identical except for the constant term $-\lambda c$).

Now let us look at the necessary conditions for optimality. For Problem 1, these read:
\begin{equation}
\nabla_\beta f_\alpha(\beta^*(\alpha))=0
\end{equation}
where we voluntarily write $\beta^*(\alpha)$ to show that this is the optimal solution for a given $\alpha$.

For Problem 2, the KKT conditions imply that we have:
\begin{align*}
\nabla_\beta \mathcal{L}(\beta^*,\lambda^*)&=\nabla_\beta f_\lambda(\beta^*)=0\\
\lambda^* (\Vert \beta^*\Vert^2-c)&=0
\end{align*}
The first line says that the gradient of the Lagrangian with respect to $\beta$ should be null and the second is the complementary condition. (We also need $\lambda^* \geq 0$, but this is less important for our discussion). Also observe that the gradient of the Lagrangian is equal to the gradient of $f_\lambda$ (objective function of problem 1 but with $\lambda$ instead of $\alpha$).

Now suppose we solve Problem 1 for a given $\alpha$ and obtain its solution $\beta^*(\alpha)$. Let $c=\Vert \beta^*(\alpha)\Vert^2$, the squared norm of the solution to Problem 1. Then $\lambda^*=\alpha$ and $\beta^*=\beta^*(\alpha)$ satisfy the KKT conditions for Problem 2, showing that both Problems have the same solution. Conversely, if you solved Problem 2, you could set $\alpha=\lambda^*$ to retrieve the same solution by solving Problem 1.

To sum it up, both problems are equivalent when $c=\Vert \beta^*(\alpha)\Vert^2$.

Joe’s answer looks good, but if you’re also looking for a citation, this paper covers it as well in Theorem 1: http://papers.nips.cc/paper/3675-efficient-and-accurate-lp-norm-multiple-kernel-learning (Note: The meat of the proof is actually in the supplemental materials).

Kloft et al, “Efficient and Accurate Lp-Norm Multiple Kernel Learning”. NIPS 2009.

You can do this directly if you want to. To solve the optimization problem
\begin{align}
\min_{\beta} ~&\Vert y-X\beta\Vert^2\\
\mathrm{s.t.}~&\Vert \beta\Vert^2\le c\ ,
\end{align}
as in the standard primal-dual procedure, first let
\begin{align}
g(\lambda)=&\inf_\beta\mathcal{L}(\beta,\lambda)\\
=&\inf_\beta\Vert y-X\beta\Vert^2+\lambda (\Vert \beta\Vert^2- c)\\
=& \Vert y-X(X^\mathrm{T}X+\lambda I)^{-1}X^\mathrm{T}y\Vert^2 + \lambda (\Vert(X^\mathrm{T}X+\lambda I)^{-1}X^\mathrm{T}y\Vert^2-c)\ ,
\end{align}
then solve $\max_{\lambda\ge 0} g(\lambda)$. You will find that
$$
\frac{\partial g}{\partial\lambda}=y^\mathrm{T}X(X^\mathrm{T}X+\lambda I)^{-2}X^\mathrm{T}y-c=0\iff c=\Vert\beta^*_{\mathrm{ridge}}(\lambda)\Vert^2\ .
$$

The matrix derivatives
\begin{align}
\frac{\partial AU(x)B}{\partial x} = & A\frac{\partial U(x)}{\partial x}B\\
\frac{\partial U(x)^{-1}}{\partial x} = &-U(x)^{-1} \frac{\partial U(x)}{\partial x}U(x)^{-1}
\end{align}
will be helpful.

Update:

By the way you can prove when $\lambda$ increases, $c$ doesn’t increase. More generally, Let $L(x;\lambda)=f(x)+\lambda g(x)$, and $x_i^*=\mathrm{arg\,min}_xL(x;\lambda_i)\,(i=1,2)$. Suppose $\lambda_2>\lambda_1$ and $g(x_2^*)>g(x_1^*)$, we have
\begin{align}
&(\lambda_2-\lambda_1)(g(x_2^*)-g(x_1^*))>0\\
\Longrightarrow & \lambda_1g(x_1^*)+\lambda_2g(x_2^*)>\lambda_1g(x_2^*)+\lambda_2g(x_1^*)\\
\Longrightarrow & [f(x_1^*)+\lambda_1g(x_1^*)]+[f(x_2^*)+\lambda_2g(x_2^*)]>[f(x_2^*)+\lambda_1g(x_2^*)]+[f(x_1^*)+\lambda_2g(x_1^*)] \ge [f(x_1^*)+\lambda_1g(x_1^*)]+[f(x_2^*)+\lambda_2g(x_2^*)]
\end{align}
which is a contradiction, so $g(x^*)$ doesn’t increase when $\lambda$ increases. In the context of OP’s problem, $c$ doesn’t increase when $\lambda$ increases.