Intereting Posts

Approximating $(1+\frac{1}{z})^z$ where $|z|$ is large
What is the Coxeter diagram for?
Rigorous proof that any linear map between vector spaces can be represented by a matrix
From norm to scalar product
Symmetry in inequalities.
Are Jordan chains always linearly independent?
To prove the inequality:- $\frac{4^m}{2\sqrt{m}}\le\binom{2m}{m}\le\frac{4^m}{\sqrt{2m+1}}$
$\mathbb{Q}(\sqrt2,\sqrt3,\sqrt{(2+\sqrt{2})(3+\sqrt{3})})$ is Galois over $\mathbb{Q}$
Density of odd numbers in a sequence relating base 2 and base 3 expansion
Homological algebra in PDE
What's the best way to teach oneself both Category Theory & Model Theory?
How do I calculate $\sum_{n\geq1}\frac{1}{n^4+1}$?
Prove that $H$ is normal.
Asymptotic of an interesting recurrence relation
Sieving integers

Tikhonov regularization (or ridge regression) adds a constraint that $\|\beta\|^2$, the $L^2$-norm of the parameter vector, is not greater than a given value (say $c$). Equivalently, it may solve an unconstrained minimization of the least-squares penalty with $\alpha\|\beta\|^2$ added, where $\alpha$ is a constant (this is the Lagrangian form of the constrained problem).

The above is from Wikipedia. Why is the unconstrained LS with $\alpha\|\beta\|^2$ added to the cost equivalent to the LS problem with an additional constraint that $\|\beta\|^2 \leq c$?

What is the relation between $\alpha$ and $c$?

- How to prove $\sum_{n=0}^{\infty} \frac{n^2}{2^n} = 6$?
- How to prove that a set R\Z is open
- Contradicting Fubini's theorem
- What's the definition of $C^k(\overline\Lambda)$ for a bounded and open $\Lambda\subseteq\mathbb R^d$?
- Convergence power series in boundary
- Theorem 3.55 Rudin (rearrangement and convergence)

Thanks!

- Showing the exponential and logarithmic functions are unique in satisfying their properties
- Faulhaber's Formula to evaluate series
- The closure of an open set in $\mathbb{R}^n$ is a manifold
- A non-compact topological space where every continuous real map attains max and min
- A smooth function's domain of being non-analytic
- Prove that $\int_{I}f=0 \iff$ the function $f\colon I\to \Bbb R$ is identically $0$.
- Limit of a summation, using integrals method
- Summation of $\sum\limits_{n=1}^{\infty} \frac{x(x+1) \cdots (x+n-1)}{y(y+1) \cdots (y+n-1)}$
- References on integration: collections of fully worked problems (and explanations) of (1) advanced and (2) unusual techniques
- locally lipschitz implies lipschitz

Let us first define the two problems:

- Problem 1: \begin{equation} \min_{\beta} ~f_\alpha(\beta):=\frac{1}{2}\Vert y-X\beta\Vert^2 +\alpha\Vert \beta\Vert^2\end{equation}
- Problem 2: \begin{align} \min_{\beta} ~&\frac{1}{2}\Vert y-X\beta\Vert^2\\ s.t.~&\Vert \beta\Vert^2-c\leq 0\end{align}

The Lagrangian for Problem 2 reads:

\begin{equation}

\mathcal{L}(\beta,\lambda)=\frac{1}{2}\Vert y-X\beta\Vert^2+\lambda (\Vert \beta\Vert^2-c)

\end{equation}

and you probably already see the resemblance with Problem 1 (identical except for the constant term $-\lambda c$).

Now let us look at the necessary conditions for optimality. For Problem 1, these read:

\begin{equation}

\nabla_\beta f_\alpha(\beta^*(\alpha))=0

\end{equation}

where we voluntarily write $\beta^*(\alpha)$ to show that this is the optimal solution for a given $\alpha$.

For Problem 2, the KKT conditions imply that we have:

\begin{align*}

\nabla_\beta \mathcal{L}(\beta^*,\lambda^*)&=\nabla_\beta f_\lambda(\beta^*)=0\\

\lambda^* (\Vert \beta^*\Vert^2-c)&=0

\end{align*}

The first line says that the gradient of the Lagrangian with respect to $\beta$ should be null and the second is the complementary condition. (We also need $\lambda^* \geq 0$, but this is less important for our discussion). Also observe that the gradient of the Lagrangian is equal to the gradient of $f_\lambda$ (objective function of problem 1 but with $\lambda$ instead of $\alpha$).

Now suppose we solve Problem 1 for a given $\alpha$ and obtain its solution $\beta^*(\alpha)$. Let $c=\Vert \beta^*(\alpha)\Vert^2$, the squared norm of the solution to Problem 1. Then $\lambda^*=\alpha$ and $\beta^*=\beta^*(\alpha)$ satisfy the KKT conditions for Problem 2, showing that both Problems have the same solution. Conversely, if you solved Problem 2, you could set $\alpha=\lambda^*$ to retrieve the same solution by solving Problem 1.

To sum it up, both problems are equivalent when $c=\Vert \beta^*(\alpha)\Vert^2$.

Joe’s answer looks good, but if you’re also looking for a citation, this paper covers it as well in Theorem 1: http://papers.nips.cc/paper/3675-efficient-and-accurate-lp-norm-multiple-kernel-learning (Note: The meat of the proof is actually in the supplemental materials).

Kloft et al, “Efficient and Accurate Lp-Norm Multiple Kernel Learning”. NIPS 2009.

You can do this directly if you want to. To solve the optimization problem

\begin{align}

\min_{\beta} ~&\Vert y-X\beta\Vert^2\\

\mathrm{s.t.}~&\Vert \beta\Vert^2\le c\ ,

\end{align}

as in the standard primal-dual procedure, first let

\begin{align}

g(\lambda)=&\inf_\beta\mathcal{L}(\beta,\lambda)\\

=&\inf_\beta\Vert y-X\beta\Vert^2+\lambda (\Vert \beta\Vert^2- c)\\

=& \Vert y-X(X^\mathrm{T}X+\lambda I)^{-1}X^\mathrm{T}y\Vert^2 + \lambda (\Vert(X^\mathrm{T}X+\lambda I)^{-1}X^\mathrm{T}y\Vert^2-c)\ ,

\end{align}

then solve $\max_{\lambda\ge 0} g(\lambda)$. You will find that

$$

\frac{\partial g}{\partial\lambda}=y^\mathrm{T}X(X^\mathrm{T}X+\lambda I)^{-2}X^\mathrm{T}y-c=0\iff c=\Vert\beta^*_{\mathrm{ridge}}(\lambda)\Vert^2\ .

$$

The matrix derivatives

\begin{align}

\frac{\partial AU(x)B}{\partial x} = & A\frac{\partial U(x)}{\partial x}B\\

\frac{\partial U(x)^{-1}}{\partial x} = &-U(x)^{-1} \frac{\partial U(x)}{\partial x}U(x)^{-1}

\end{align}

will be helpful.

**Update**:

By the way you can prove when $\lambda$ increases, $c$ doesn’t increase. More generally, Let $L(x;\lambda)=f(x)+\lambda g(x)$, and $x_i^*=\mathrm{arg\,min}_xL(x;\lambda_i)\,(i=1,2)$. Suppose $\lambda_2>\lambda_1$ and $g(x_2^*)>g(x_1^*)$, we have

\begin{align}

&(\lambda_2-\lambda_1)(g(x_2^*)-g(x_1^*))>0\\

\Longrightarrow & \lambda_1g(x_1^*)+\lambda_2g(x_2^*)>\lambda_1g(x_2^*)+\lambda_2g(x_1^*)\\

\Longrightarrow & [f(x_1^*)+\lambda_1g(x_1^*)]+[f(x_2^*)+\lambda_2g(x_2^*)]>[f(x_2^*)+\lambda_1g(x_2^*)]+[f(x_1^*)+\lambda_2g(x_1^*)] \ge [f(x_1^*)+\lambda_1g(x_1^*)]+[f(x_2^*)+\lambda_2g(x_2^*)]

\end{align}

which is a contradiction, so $g(x^*)$ doesn’t increase when $\lambda$ increases. In the context of OP’s problem, $c$ doesn’t increase when $\lambda$ increases.

- Express roots in polynomials of equation $x^3+x^2-2x-1=0$
- Finding an orthogonal matrix with given absolute value
- Does a finite commutative ring necessarily have a unity?
- Proving that the supremum of $(a,b)$ is $b$.
- What is the kernel of $K \to K$, defined by $T \mapsto x$?
- Proof of Goldstine's theorem
- Pearson correlation and metric properties
- Finding the limit of a quotient
- Endomorphisms of modules satisfying chain conditions and counterexamples.
- Proof about Steady-State distribution of a Markov chain
- Discrete valuations of the rational numbers
- Are the unit partial quotients of $\pi, \log(2), \zeta(3) $ and other constants $all$ governed by $H=0.415\dots$?
- Does there exist a real Hilbert space with countably infinite dimension as a vector space over $\mathbb{R}$?
- $\lim \limits_{x \to 0} \frac{1}{x^2}-\csc^2x $ using L'Hôpital's rule
- Somos 3 Sequence