Intereting Posts

Inequality between Independent Erlang random variables
Algebraic numbers that cannot be expressed using integers and elementary functions
preservation of completeness under homeomorphism
Number of points on line segment
Showing that a one-to-one linear transformation maps a linearly independent set onto a linearly independent set
Characterization of measurability by closed sets.
Evaluating the factorial-related limit $\lim_{x \to \infty} (x + 1)!^{1 / (x + 1)} – x!^{1/x}$
$A^m\hookrightarrow A^n$ implies $m\leq n$ for a ring $A\neq 0$
What is the recurrence relation in this problem?
Nth Derivative of a function
Applications of Pseudodifferential Operators
Tensors: Acting on Vectors vs Multilinear Maps
Calculating $w=\sqrt {-i}$
units of group ring $\mathbb{Q}(G)$ when $G$ is infinite and cyclic
How else can we be nauty?

The nuclear norm is defined in the following way

$$\|X\|_*=\mathrm{tr}(\sqrt{X^T X})$$

I’m trying to take the derivative of the nuclear norm with respect to its argument

- Help needed to define a constraint in an optimization problem?
- Second derivative positive $\implies$ convex
- Prove that a polytope is closed
- definition of strongly convex
- Proving an affine set's equivalence to the solution set of $Ax=b$
- Simple resource for Lagrangian constrained optimization?

$$\frac{\partial \|X\|_*}{\partial X}$$

Note that $\|X\|_*$ is a norm and is convex. I’m using this for some coordinate descent optimization algorithm. Thank you for your help.

- Prove $g(x+h) = g(x) + hg'(x) + \frac{1}{2} h^2 g''(x) + o(h^2)$ from definition of limit
- If $\,\lim_{x\to 0} \Big(f\big({a\over x}+b\big) - {a\over x}\,f'\big({a\over x}+b\big)\Big)=c,\,$ find $\,\lim_{x\to\infty} f(x)$
- Functions with no closed-form derivative
- How to obtain dimension of solution space of ODE?
- What is the derivative of: $f(x)=x^{2x^{3x^{4x^{5x^{6x^{7x^{.{^{.^{.}}}}}}}}}}$?
- Prove that if $f$ is differentiable on $$ and $f$ is Lipschitz, then $f$ has a bounded derivative.
- Mean value theorem for the second derivative, when the first derivative is zero at endpoints
- Help with separable differential equation? $\frac{dy}{dx} =2y^2$
- Proving that if $|f''(x)| \le A$ then $|f'(x)| \le A/2$
- The derivative of $e^x$ using the definition of derivative as a limit and the definition $e^x = \lim_{n\to\infty}(1+x/n)^n$, without L'Hôpital's rule

As I said in my comment, in a convex optimization setting, one would normally *not* use the derivative/subgradient of the nuclear norm function. It is, after all, nondifferentiable, and as such cannot be used in standard descent approaches (though I suspect some people have probably applied semismooth methods to it).

Here are two alternate approaches for “handling” the nuclear norm.

*Semidefinite programming*. We can use the following identity: the nuclear norm inequality $\|X\|_*\leq y$ is satisfied if and only if there exist symmetric matrices $W_1$, $W_2$ satisfying

$$\begin{bmatrix} W_1 & X \\ X^T & W_2 \end{bmatrix} \succeq 0, ~ \mathop{\textrm{Tr}}W_1 + \mathop{\textrm{Tr}}W_2 \leq 2 y$$

Here, $\succeq 0$ should be interpreted to mean that the $2\times 2$ block matrix is positive semidefinite. Because of this transformation, you can handle nuclear norm minimization or upper bounds on the nuclear norm in any semidefinite programming setting.

For instance, given some equality constraints $\mathcal{A}(X)=b$ where $\mathcal{A}$ is a linear operator, you could do this:

$$\begin{array}{ll}

\text{minimize} & \|X\|_* \\

\text{subject to} & \mathcal{A}(X)=b \end{array}

\quad\Longleftrightarrow\quad

\begin{array}{ll}

\text{minimize} & \tfrac{1}{2}\left( \mathop{\textrm{Tr}}W_1 + \mathop{\textrm{Tr}}W_2 \right) \\

\text{subject to} & \begin{bmatrix} W_1 & X \\ X^T & W_2 \end{bmatrix} \succeq 0 \\ & \mathcal{A}(X)=b \end{array}

$$

My software CVX uses this transformation to implement the function `norm_nuc`

, but any semidefinite programming software can handle this. One downside to this method is that semidefinite programming can be expensive; and if $m\ll n$ or $n\ll m$, that expense is exacerbated, since size of the linear matrix inequality is $(m+n)\times (m+n)$.

*Projected/proximal gradients*. Consider the following related problems:

$$\begin{array}{ll}

\text{minimize} & \|\mathcal{A}(X)-b\|_2^2 \\

\text{subject to} & \|X\|_*\leq \delta

\end{array} \quad

$$ $$\text{minimize} ~~ \|\mathcal{A}(X)-b\|_2^2+\lambda\|X\|_*$$

Both of these problems trace out *tradeoff curves*: as $\delta$ or $\lambda$ is varied, you generate a tradeoff between $\|\mathcal{A}(X)-b\|$ and $\|X\|_*$. In a very real sense, these problems are *equivalent*: for a fixed value of $\delta$, there is going to be a corresponding value of $\lambda$ that yields the *exact* same value of $X$ (at least on the interior of the tradeoff curve). So it is worth considering these problems together.

The first of these problems can be solved using a *projected gradient* approach. This approach alternates between gradient steps on the smooth objective and *projections* back onto the feasible set $\|X\|_*\leq \delta$. The projection step requires being able to compute

$$\mathop{\textrm{Proj}}(Y) = \mathop{\textrm{arg min}}_{\{X\,|\,\|X\|_*\leq\delta\}} \| X – Y \|$$

which can be done at about the cost of a single SVD plus some $O(n)$ operations.

The second model can be solved using a *proximal gradient* approach, which is very closely related to projected gradients. In this case, you alternate between taking gradient steps on the smooth portion, followed by an evaluation of the *proximal function*

$$\mathop{\textrm{Prox}}(Y) = \mathop{\textrm{arg min}}_X \|X\|_* + \tfrac{1}{2}t^{-1}\|X-Y\|^2$$

where $t$ is a step size. This function can also be computed with a single SVD and some thresholding. It’s actually easier to implement than the projection. For that reason, the proximal model is preferable to the projection model. When you have the choice, solve the easier model!

I would encourage you to do a literature search on proximal gradient methods, and nuclear norm problems in particular. There is actually quite a bit of work out there on this. For example, these lecture notes by Laurent El Ghaoui at Berkeley talk about the proximal gradient method and introduce the prox function for nuclear norms. My software TFOCS includes both the nuclear norm projection and the prox function. You do not have to use this software, but you could look at the implementations of `prox_nuclear`

and `proj_nuclear`

for some hints.

Start with the SVD decomposition of $x$:

$$x=U\Sigma V^T$$

Then $$\|x\|_*=tr(\sqrt{x^Tx})=tr(\sqrt{(U\Sigma V^T)^T(U\Sigma V^T)})$$

$$\Rightarrow \|x\|_*=tr(\sqrt{V\Sigma U^T U\Sigma V^T})=tr(\sqrt{V\Sigma^2V^T})$$

By circularity of trace:

$$\Rightarrow \|x\|_*=tr(\sqrt{V^TV\Sigma^2})=tr(\sqrt{V^TV\Sigma^2})=tr(\sqrt{\Sigma^2})=tr(\Sigma)$$

Since the elements of $\Sigma$ are non-negative.

Therefore nuclear norm can be also defined as the sum of the absolute values of the singular value decomposition of the input matrix.

Now, note that the absolute value function is not differentiable on every point in its domain, but you can find a subgradient.

$$\frac{\partial \|x\|_*}{\partial x}=\frac{\partial tr(\Sigma)}{\partial x}=\frac{ tr(\partial\Sigma)}{\partial x}$$

You should find $\partial\Sigma$. Since $\Sigma$ is diagonal, the subdifferential set of $\Sigma$ is: $\partial\Sigma=\Sigma\Sigma^{-1}\partial\Sigma$, now we have:

$$\frac{\partial \|x\|_*}{\partial x}=\frac{ tr(\Sigma\Sigma^{-1}\partial\Sigma)}{\partial x}$$ (I)

So we should find $\partial\Sigma$.

$x=U\Sigma V^T$, therefore:

$$\partial x=\partial U\Sigma V^T+U\partial\Sigma V^T+U\Sigma\partial V^T$$

Therefore:

$$U\partial\Sigma V^T=\partial x-\partial U\Sigma V^T-U\Sigma\partial V^T$$

$$\Rightarrow U^TU\partial\Sigma V^TV=U^T\partial xV-U^T\partial U\Sigma V^TV-U^TU\Sigma\partial V^TV$$

$$\Rightarrow \partial\Sigma =U^T\partial xV-U^T\partial U\Sigma – \Sigma\partial V^TV$$

You can show that $-U^T\partial U\Sigma – \Sigma\partial V^TV=0$ (Hint: diagonal and antisymmetric matrices, I skip the proof here.), therefore:

$$\partial\Sigma =U^T\partial xV$$

By substitution into (I):

$$\frac{\partial \|x\|_*}{\partial x}=\frac{ tr(\Sigma\Sigma^{-1} \partial\Sigma)}{\partial x}=\frac{ tr(U^T\partial xV)}{\partial x}=\frac{ tr(VU^T\partial x)}{\partial x}=(VU^T)^T$$

Therefore you can use $U V^T$ as the subgradient.

You can use this nice result for the differential of the trace

$$ \eqalign {

d\,\mathrm{tr}(f(A)) &= f'(A^T):dA \cr

} $$

to write

$$ \eqalign {

d\,\mathrm{tr}((x^Tx)^{\frac {1} {2}}) &= \frac {1} {2} (x^Tx)^{-\frac {1} {2}}:d(x^Tx) \cr

&= \frac {1} {2} (x^Tx)^{-\frac {1} {2}}:(dx^T x + x^T dx) \cr

&= x(x^Tx)^{-\frac {1} {2}}: dx \cr

} $$

Yielding the derivative as

$$ \eqalign {

\frac {\partial\|x\|_*} {\partial x} &= x(x^Tx)^{-\frac {1} {2}} \cr

} $$

Another nice result (this one’s from Higham)

$$ \eqalign {

A\,f(B\,A) &= f(A\,B)\,A \cr

} $$

yields an alternative expression with (potentially) smaller dimensions

$$ \eqalign {

\frac {\partial\|x\|_*} {\partial x} &= (x\,x^T)^{-\frac {1} {2}}x \cr

} $$

While the square root of $x^Tx$ certainly exists, the inverse may not. So you might need some sort of regularization, e.g.

$$ \eqalign {

\frac {\partial\|x\|_*} {\partial x} &= x(x^Tx+\varepsilon I)^{-\frac {1} {2}} \cr

} $$

Of course, $n:x\in M_{n,p}\rightarrow tr(\sqrt{x^Tx})$ can be derived in $x$ s.t. $x^Tx$ is invertible, that is, in the generic case when $n\geq p$ (if $n\leq p$, then consider $tr(\sqrt{xx^T})$). The result of greg is correct ; yet, his proof is unclear and I rewrite it for convenience.

If $A$ is symmetric $>0$, then $f:A\rightarrow \sqrt{A}$ is a matrix function (cf. the Higham’s book about this subject) ; if $g$ is a matrix function and $\phi:A\rightarrow tr(g(A))$, then its derivative is $D\phi_A:K\rightarrow tr(g'(A)K)$. Let $A=x^Tx$. Thus $Dn_x:H\rightarrow tr(f'(A)(H^Tx+x^TH))=tr((f'(A)^Tx^T+f'(A)x^T)H)$. Then the gradient of $n$ is $\nabla(n)(x)=x(f'(A)+f'(A)^T)=2xf'(A)=x(x^Tx)^{-1/2}$.

As Alt did, we can use the SVD decomposition and we find $\nabla(n)(x)=U\Sigma (\Sigma^T\Sigma)^{-1/2}V^T$ ($=UV^T$ if $n=p$). Recall to Alt that the diagonal of $\Sigma$ is $\geq 0$.

Alt’s answer has a fundamental error. First of all, the nuclear norm is the sum of all singular values not the absolute of the singular values.

To make it right, we need to first define the square root for matrix as $\sqrt{BB}=B$. As Alt shown,

$||x||_*=tr(\sqrt{V\Sigma^2V^T})$

But we cannot use the circularity of trace here because it is not well defined.

We should do something like this,

$||x||_*=tr(\sqrt{V\Sigma^2V^T})=tr(\sqrt{V\Sigma V^TV\Sigma V^T})=tr(V\Sigma V^T)$,

the last equality is based on the definition of the square root for matrix described above. Then by the circularity of trace, we get

$tr(V\Sigma V^T)=tr(\Sigma V^TV)=tr(\Sigma)=\sum_i \sigma_i$.

What about $|| M ||_{F} = \mathrm{Trace}(M^{T}M)$?

- Maximum of $|(z-a_1)\cdots(z-a_n)|$ on the unit circle
- What is the smallest value of $n$ for which $\phi(n) \ne \phi(k)$ for any $k<n,$ where $n=4m$?
- Convergence in probability inverse of random variable
- What is $\lim_{x\to\infty}\left(\sin{\frac 1x}+\cos{\frac 1x}\right)^x$?
- Show that $f(x)\equiv 0$ if $ \int_0^1x^nf(x)\,dx=0$
- Prove homotopy equivalence of two spaces
- closed bounded subset in metric space not compact
- Where did $-4x$ come from?
- Are there any infinites not from a powerset of the natural numbers?
- $X$ and $f(X)$ independent $\Longleftrightarrow$ $f(X)$ is degenerate
- Perron's formula (Passing a limit under the integral)
- Prime with digits reversed is prime?
- Any two groups of three elements are isomorphic – Fraleigh p. 47 4.25(b)
- Probability that the last ball is white?
- Arnol'd's trivium problem #68