I am currently trying to understand the math used training neural network, in which gradient descent is used to minimize the error between the target and extracted. I currently following/reading this tutorial So an example: Given a network like this: We wish to minimize the error function being for one training set (x,y) \begin{align} J(W,b; […]

A clear article on Nesterov’s Accelerated Gradient Descent (S. Bubeck, April 2013) says The intuition behind the algorithm is quite difficult to grasp, and unfortunately the analysis will not be very enlightening either. This seems odd, for such a powerful method (“you do not really understand something unless you can explain it to your grandmother”). […]

I understand that the gradient is the direction of steepest descent (ref: Why is gradient the direction of steepest ascent? and Gradient of a function as the direction of steepest ascent/descent). However, I am not able to visualize it. The Blue arrow is the one pointing towards minima/maxima. The gradient (black arrow) is not and […]

I just read about projected gradient descent but I did not see the intuition to use Projected one instead of normal gradient descent. Would you tell me the reason and preferable situations of projected gradient descent? What does that projection contribute?

In order to find the local minima of a scalar function $p(x), x\in \mathbb{R}^3$, I know we can use the gradient descent method: $$x_{k+1}=x_k-\alpha_k \nabla_xp(x)$$ where $\alpha_k$ is the step size and $\nabla_xp(x)$ is the gradient of $p(x)$. My question is: what if $x$ must be constrained on a sphere, i.e., $\|x_k\|=1$? Then we are […]

I am trying to really understand why the gradient of a function gives the direction of steepest ascent intuitively. Assuming that the function is differentiable at the point in question, a) I had a look at a few resources online and also looked at this Why is gradient the direction of steepest ascent? , a […]

I’m studying stochastic gradient descent algorithm for optimization. It looks like this: $L(w) = \frac{1}{N} \sum_{n=1}^{N} L_n(w)$ $w^{(t+1)} = w^{(t)} – \gamma \nabla L_n(w^{(t)})$ I assume that $n$ is chosen randomly each time the algorithm iterates. (¿?) The problem comes when my notes state that $E[\nabla L_n(w)] = \nabla L(w)$. Where does this come from?

Intereting Posts

How to prove or disprove statements about sets
If a function is undefined at a point, is it also discontinuous at that point?
Proof $(1+1/n)^n$ is an increasing sequence
A property of an Archimedean cyclically ordered group
Prove that if a set $E$ is closed iff it's complement $E^{c}$ is open
Is $\int_0^\infty \frac{dt}{e^t-xt}$ analytic continuation of $\sum_{k=1}^\infty \frac{(k-1)!}{k^k} x^{k-1}$?
Normal Intersection of Parabola
If $f_n(x_n) \to f(x)$ whenever $x_n \to x$, show that $f$ is continuous
Simple Complex Number Problem: $1 = -1$
Are all algebraic integers with absolute value 1 roots of unity?
compute one improper integral involving arctangent
isomorphism, integers of mod $n$.
Dirac delta in polar coordinates
Are $(\mathbb{R},+)$ and $(\mathbb{C},+)$ isomorphic as additive groups?
Geometric interpretations of matrix inverses