Articles of gradient descent

Trying to understand the math behind backpropagation in neural nets

I am currently trying to understand the math used training neural network, in which gradient descent is used to minimize the error between the target and extracted. I currently following/reading this tutorial So an example: Given a network like this: We wish to minimize the error function being for one training set (x,y) \begin{align} J(W,b; […]

Intuition for gradient descent with Nesterov momentum

A clear article on Nesterov’s Accelerated Gradient Descent (S. Bubeck, April 2013) says The intuition behind the algorithm is quite difficult to grasp, and unfortunately the analysis will not be very enlightening either. This seems odd, for such a powerful method (“you do not really understand something unless you can explain it to your grandmother”). […]

Gradient is NOT the direction that points to the minimum or maximum

I understand that the gradient is the direction of steepest descent (ref: Why is gradient the direction of steepest ascent? and Gradient of a function as the direction of steepest ascent/descent). However, I am not able to visualize it. The Blue arrow is the one pointing towards minima/maxima. The gradient (black arrow) is not and […]

What is the difference between projected gradient descent and ordinary gradient descent?

I just read about projected gradient descent but I did not see the intuition to use Projected one instead of normal gradient descent. Would you tell me the reason and preferable situations of projected gradient descent? What does that projection contribute?

Gradient descent with constraints

In order to find the local minima of a scalar function $p(x), x\in \mathbb{R}^3$, I know we can use the gradient descent method: $$x_{k+1}=x_k-\alpha_k \nabla_xp(x)$$ where $\alpha_k$ is the step size and $\nabla_xp(x)$ is the gradient of $p(x)$. My question is: what if $x$ must be constrained on a sphere, i.e., $\|x_k\|=1$? Then we are […]

Gradient of a function as the direction of steepest ascent/descent

I am trying to really understand why the gradient of a function gives the direction of steepest ascent intuitively. Assuming that the function is differentiable at the point in question, a) I had a look at a few resources online and also looked at this Why is gradient the direction of steepest ascent? , a […]

Expectation of gradient in stochastic gradient descent algorithm

I’m studying stochastic gradient descent algorithm for optimization. It looks like this: $L(w) = \frac{1}{N} \sum_{n=1}^{N} L_n(w)$ $w^{(t+1)} = w^{(t)} – \gamma \nabla L_n(w^{(t)})$ I assume that $n$ is chosen randomly each time the algorithm iterates. (¿?) The problem comes when my notes state that $E[\nabla L_n(w)] = \nabla L(w)$. Where does this come from?