I want to understand intuitively why it is that the gradient gives the direction of steepest ascent. (I will consider the case of $f:\mathbb{R}^2\to\mathbb{R}$)
The standard proof is to note that the directional derivative is $$D_vf=v\cdot \nabla f=|\nabla f|\,\cos\theta$$ which is maximized at $\theta=0$. This is a good verification, but it doesn’t really help me understand the result.
Maybe the following helps to understand the intuition behind the object $\langle \nabla f,v\rangle$ occuring in the standard proof: $\nabla f(x)$ is the vector composed of the directional derivatives of $f$ in the directions of the $n$ standard basis vectors $e_1,\ldots e_n$. Now consider a unit vector $v$ in the 1-norm, i.e. $\sum |v_i|=1$. For simplicity let’s think of the case $v_i\geq 0$.
Therefore $\langle \nabla f(x),v\rangle = \sum \frac{\partial f}{\partial x_i}(x) v_i$ is a convex combination of directional derivatives which is the directional derivative in the convex combination of the different directions. (Remember that derivatives are intuitivly linear approximations to the function) This is the equation $D_v f(x) = \langle \nabla f(x),v\rangle$. Thus: If we want to find the $v$ with maximal value of $D_v f(x)$ then we have to maximize $\langle \nabla f(x),v\rangle$.
Now the intuition behind $\langle u,v\rangle$ comes from thinking in terms of orthogonal projections: The scalar product equals the (signed) length of the projection of $u$ onto the line given by the direction $v$. This length can only be maximal if nothing is lost during the projection, i.e. if there is no orthogonal component. Therefore $u$ must be a multiple of $v$ and a positive multiple because we want a maximum.
Putting everything together: $D_vf(x)$ is maximal iff $v$ is the direction of $\nabla f(x)$.
$\newcommand{\R}{\mathbf{R}}$Let $U$ be an open set in $\R^{2}$ and $f:U \to \R$ a differentiable function. If $x_{0} \in U$, then by definition there exists a linear function $Df(x_{0}):\R^{2} \to \R$ such that
$$
\lim_{x \to x_{0}} \frac{|f(x) – f(x_{0}) – Df(x_{0})(x – x_{0})|}{\|x – x_{0}\|} = 0.
$$
If $e_{1}$ and $e_{2}$ denote the standard basis of $\R^{2}$, then the partial derivatives of $f$ at $x_{0}$ are defined to be the components of $Df(x_{0})$:
$$
f_{1}(x_{0}) = Df(x_{0})(e_{1}),\qquad
f_{2}(x_{0}) = Df(x_{0})(e_{2}).
$$
That is, $[\begin{matrix} f_{1}(x_{0}) & f_{2}(x_{0})\end{matrix}]$ is the standard matrix of $Df(x_{0})$.
The gradient vector $\nabla f(x_{0})$ is defined to be the transpose,
$$
\nabla f(x_{0})
= \left[\begin{matrix}
f_{1}(x_{0}) \\
f_{2}(x_{0})
\end{matrix}\right].
$$
Rearranging the definition of the derivative gives the linear approximation formula
$$
f(x) = f(x_{0}) + Df(x_{0})(x – x_{0}) + o\bigl(\|x – x_{0}\|\bigr).
$$
Particularly, if
$v = \left[\begin{matrix}
v_{1} \\
v_{2}
\end{matrix}\right]$
is an arbitrary vector, then
\begin{align*}
f(x_{0} + tv)
&= f(x_{0}) + Df(x_{0})(tv) + o(t) \\
&= f(x_{0}) + t\, Df(x_{0})(v) + o(t) \\
&= f(x_{0}) + t\bigl(f_{1}(x_{0})v_{1} + f_{2}(x_{0})v_{2}\bigr) + o(t) \\
&= f(x_{0}) + t\, \nabla f(x_{0})\cdot v + o(t).
\end{align*}
(The first two equalities follow from linearity of $Df(x_{0})$; the third comes from multiplying matrices; the fourth is the formula for the dot product.)
Introducing the function $g_{v}(t) = f(x_{0} + tv)$, the preceding equation becomes
$$
g_{v}'(0) = \nabla f(x_{0})\cdot v.
$$
This derivative is the rate of change of $f$ at $x_{0}$ in the direction $v$.
If $\nabla f(x_{0}) \neq (0, 0)$, and if $v$ is a unit vector making angle $\theta$ with $\nabla f(x_{0})$, then
$$
f(x_{0} + tv) = f(x_{0}) + t\|\nabla f(x_{0})\|\cos\theta + o(t).
$$
That is, $g_{v}'(0) = \|\nabla f(x_{0})\|\cos\theta$.
It follows immediately that
If $\theta = 0$, i.e., if $v = \dfrac{\nabla f(x_{0})}{\|\nabla f(x_{0})\|}$, then $g_{v}'(0)$ is maximized over all unit vectors.
If $\theta = \pi$, i.e., if $v = -\dfrac{\nabla f(x_{0})}{\|\nabla f(x_{0})\|}$, then $g_{v}'(0)$ is minimized over all unit vectors.
If $\theta = \pi/2$, i.e., if $v \cdot \nabla f(x_{0}) = 0$, then $g_{v}'(0) = 0$, signifying that $f$ is constant to first order at $x_{0}$ in the direction $v$, namely that $v$ is tangent to the level curve of $f$ through $x_{0}$.