Intereting Posts

How many copies of $C_4$ are there in $K_n$
Considering the linear system $Y'=AY$
What is exactly meant by “preserves the group operation”?
Two circumcircles of triangles defined relative to a fixed acute triangle are tangent to each other (IMO 2015)
How to prove $P(A) \cup P(B) \subseteq P(A \cup B) $
Values for $(1+i)^{2/3}$
Induction: $n^{n+1} > (n+1)^n$ and $(n!)^2 \leq \left(\frac{(n + 1)(2n + 1)}{6}\right)^n$
If a property in $\mathbb{N}$ is true up to $10^{47}$ are there reasons to think it is probably true in all $\mathbb{N}$?
Geometric interpretation of Young's inequality
knowledge needed to understand Fermat's last theorem proof
The coupon collectors problem
In GCD domain every invertible ideal is principal
how many terms will there be once you collect terms with equal monomials? What is the sum of all the coefficients?
How to sum $\frac1{1\cdot 2\cdot 3\cdot 4} + \frac4{3\cdot 4\cdot 5\cdot 6} + \frac9{5\cdot 6\cdot 7\cdot 8} + \cdots$ quickly?
Generalization of the Jordan form for infinite matrices

Tikhonov regularization (or ridge regression) adds a constraint that $\|\beta\|^2$, the $L^2$-norm of the parameter vector, is not greater than a given value (say $c$). Equivalently, it may solve an unconstrained minimization of the least-squares penalty with $\alpha\|\beta\|^2$ added, where $\alpha$ is a constant (this is the Lagrangian form of the constrained problem).

The above is from Wikipedia. Why is the unconstrained LS with $\alpha\|\beta\|^2$ added to the cost equivalent to the LS problem with an additional constraint that $\|\beta\|^2 \leq c$?

What is the relation between $\alpha$ and $c$?

- a Fourier transform (sinc)
- Better Proofs Than Rudin's For The Inverse And Implicit Function Theorems
- Continuity of the roots of a polynomial in terms of its coefficients
- An inequality regarding the derivative of two concave functions
- When is the difference of two convex functions convex?
- Question about Hatcher's book CW complex

Thanks!

- A smooth function's domain of being non-analytic
- If $f\colon \mathbb{R} \to \mathbb{R}$ is such that $f (x + y) = f (x) f (y)$ and continuous at $0$, then continuous everywhere
- prove Taylor of $R(a)$ converges $R$ but its sum equals $R(a)$ for $a$ in interval.Which interval? Pls I'm glad to give an idea or hint?:
- Characterization of Dirac Masses on $C(,\mathbb{R}^d)$
- partial derivatives continuous $\implies$ differentiability in Euclidean space
- Uniform Continuity of $f(x)=x^3$
- L'Hopital's rule and series convergence
- Meaning of “kernel”
- Why sum of two little “o” notation is equal little “o” notation from sum?
- When $f(x+1)-f(x)=f'(x)$, what are the solutions for $f(x)$?

Let us first define the two problems:

- Problem 1: \begin{equation} \min_{\beta} ~f_\alpha(\beta):=\frac{1}{2}\Vert y-X\beta\Vert^2 +\alpha\Vert \beta\Vert^2\end{equation}
- Problem 2: \begin{align} \min_{\beta} ~&\frac{1}{2}\Vert y-X\beta\Vert^2\\ s.t.~&\Vert \beta\Vert^2-c\leq 0\end{align}

The Lagrangian for Problem 2 reads:

\begin{equation}

\mathcal{L}(\beta,\lambda)=\frac{1}{2}\Vert y-X\beta\Vert^2+\lambda (\Vert \beta\Vert^2-c)

\end{equation}

and you probably already see the resemblance with Problem 1 (identical except for the constant term $-\lambda c$).

Now let us look at the necessary conditions for optimality. For Problem 1, these read:

\begin{equation}

\nabla_\beta f_\alpha(\beta^*(\alpha))=0

\end{equation}

where we voluntarily write $\beta^*(\alpha)$ to show that this is the optimal solution for a given $\alpha$.

For Problem 2, the KKT conditions imply that we have:

\begin{align*}

\nabla_\beta \mathcal{L}(\beta^*,\lambda^*)&=\nabla_\beta f_\lambda(\beta^*)=0\\

\lambda^* (\Vert \beta^*\Vert^2-c)&=0

\end{align*}

The first line says that the gradient of the Lagrangian with respect to $\beta$ should be null and the second is the complementary condition. (We also need $\lambda^* \geq 0$, but this is less important for our discussion). Also observe that the gradient of the Lagrangian is equal to the gradient of $f_\lambda$ (objective function of problem 1 but with $\lambda$ instead of $\alpha$).

Now suppose we solve Problem 1 for a given $\alpha$ and obtain its solution $\beta^*(\alpha)$. Let $c=\Vert \beta^*(\alpha)\Vert^2$, the squared norm of the solution to Problem 1. Then $\lambda^*=\alpha$ and $\beta^*=\beta^*(\alpha)$ satisfy the KKT conditions for Problem 2, showing that both Problems have the same solution. Conversely, if you solved Problem 2, you could set $\alpha=\lambda^*$ to retrieve the same solution by solving Problem 1.

To sum it up, both problems are equivalent when $c=\Vert \beta^*(\alpha)\Vert^2$.

Joe’s answer looks good, but if you’re also looking for a citation, this paper covers it as well in Theorem 1: http://papers.nips.cc/paper/3675-efficient-and-accurate-lp-norm-multiple-kernel-learning (Note: The meat of the proof is actually in the supplemental materials).

Kloft et al, “Efficient and Accurate Lp-Norm Multiple Kernel Learning”. NIPS 2009.

You can do this directly if you want to. To solve the optimization problem

\begin{align}

\min_{\beta} ~&\Vert y-X\beta\Vert^2\\

\mathrm{s.t.}~&\Vert \beta\Vert^2\le c\ ,

\end{align}

as in the standard primal-dual procedure, first let

\begin{align}

g(\lambda)=&\inf_\beta\mathcal{L}(\beta,\lambda)\\

=&\inf_\beta\Vert y-X\beta\Vert^2+\lambda (\Vert \beta\Vert^2- c)\\

=& \Vert y-X(X^\mathrm{T}X+\lambda I)^{-1}X^\mathrm{T}y\Vert^2 + \lambda (\Vert(X^\mathrm{T}X+\lambda I)^{-1}X^\mathrm{T}y\Vert^2-c)\ ,

\end{align}

then solve $\max_{\lambda\ge 0} g(\lambda)$. You will find that

$$

\frac{\partial g}{\partial\lambda}=y^\mathrm{T}X(X^\mathrm{T}X+\lambda I)^{-2}X^\mathrm{T}y-c=0\iff c=\Vert\beta^*_{\mathrm{ridge}}(\lambda)\Vert^2\ .

$$

The matrix derivatives

\begin{align}

\frac{\partial AU(x)B}{\partial x} = & A\frac{\partial U(x)}{\partial x}B\\

\frac{\partial U(x)^{-1}}{\partial x} = &-U(x)^{-1} \frac{\partial U(x)}{\partial x}U(x)^{-1}

\end{align}

will be helpful.

**Update**:

By the way you can prove when $\lambda$ increases, $c$ doesn’t increase. More generally, Let $L(x;\lambda)=f(x)+\lambda g(x)$, and $x_i^*=\mathrm{arg\,min}_xL(x;\lambda_i)\,(i=1,2)$. Suppose $\lambda_2>\lambda_1$ and $g(x_2^*)>g(x_1^*)$, we have

\begin{align}

&(\lambda_2-\lambda_1)(g(x_2^*)-g(x_1^*))>0\\

\Longrightarrow & \lambda_1g(x_1^*)+\lambda_2g(x_2^*)>\lambda_1g(x_2^*)+\lambda_2g(x_1^*)\\

\Longrightarrow & [f(x_1^*)+\lambda_1g(x_1^*)]+[f(x_2^*)+\lambda_2g(x_2^*)]>[f(x_2^*)+\lambda_1g(x_2^*)]+[f(x_1^*)+\lambda_2g(x_1^*)] \ge [f(x_1^*)+\lambda_1g(x_1^*)]+[f(x_2^*)+\lambda_2g(x_2^*)]

\end{align}

which is a contradiction, so $g(x^*)$ doesn’t increase when $\lambda$ increases. In the context of OP’s problem, $c$ doesn’t increase when $\lambda$ increases.

- Improper integral $sin(x)/x $ converges absolutely, conditionaly or diverges?
- Homology Group of Quotient Space
- ${{p^k}\choose{j}}\equiv 0\pmod{p}$ for $0 < j < p^k$
- Structure Theorem for abelian torsion groups that are not finitely generated
- How to show set of all bounded, analytic function forms a Banach space?
- Maybe is right $\frac{n^2 + 1}{4k + 3} \notin \mathbb{Z}, n, k \in \mathbb{N}^{+}$
- Parametrize the curve of intersection of 2 surfaces
- Proving that given any two points in a connected manifold, there exists a diffeomorphism taking one to the other
- What is the definition of $2.5!$? (2.5 factorial)
- Show that $|z_1 + z_2|^2 < (1+C)|z_1|^2 + \left(1 + \frac{1}{C}\right) |z_2|^2$
- Ackermann Function primitive recursive
- Probability of 2 Cards being adjacent
- How to show if $A \subseteq B$, then $A \cap B = A$?
- When can we not treat differentials as fractions? And when is it perfectly OK?
- Properties of Adjoint functors