how to prove the chain rule?

I have just learnt about the chain rule but my book doesn’t mention a proof on it. I tried to write a proof myself but can’t write it. So can someone please tell me about the proof for the chain rule in elementary terms because I have just started learning calculus.

Solutions Collecting From Web of "how to prove the chain rule?"

Assuming everything behaves nicely ($f$ and $g$ can be differentiated, and $g(x)$ is different from $g(a)$ when $x$ and $a$ are close), the derivative of $f(g(x))$ at the point $x = a$ is given by
$$
\lim_{x \to a}\frac{f(g(x)) – f(g(a))}{x-a}\\ = \lim_{x\to a}\frac{f(g(x)) – f(g(a))}{g(x) – g(a)}\cdot \frac{g(x) – g(a)}{x-a}
$$
where the second line becomes $f'(g(a))\cdot g'(a)$, by definition of derivative.

One approach is to use the fact the “differentiability” is equivalent to “approximate linearity”, in the sense that if $f$ is defined in some neighborhood of $a$, then
$$
f'(a) = \lim_{h \to 0} \frac{f(a + h) – f(a)}{h}\quad\text{exists}
$$
if and only if
$$
f(a + h) = f(a) + f'(a) h + o(h)\quad\text{at $a$ (i.e., “for small $h$”).}
\tag{1}
$$
(As usual, “$o(h)$” denotes a function satisfying $o(h)/h \to 0$ as $h \to 0$.)

If $f$ is differentiable at $a$ and $g$ is differentiable at $b = f(a)$, and if we write $b + k = y = f(x) = f(a + h)$, then
$$
k = y – b = f(a + h) – f(a) = f'(a) h + o(h),
$$
so $o(k) = o(h)$, i.e., any quantity negligible compared to $k$ is negligible compared to $h$. Now we simply compose the linear approximations of $g$ and $f$:
\begin{align*}
f(a + h) &= f(a) + f'(a) h + o(h), \\
g(b + k) &= g(b) + g'(b) k + o(k), \\
(g \circ f)(a + h)
&= (g \circ f)(a) + g’\bigl(f(a)\bigr)\bigl[f'(a) h + o(h)\bigr] + o(k) \\
&= (g \circ f)(a) + \bigl[g’\bigl(f(a)\bigr) f'(a)\bigr] h + o(h).
\end{align*}
Since the right-hand side has the form of a linear approximation, (1) implies that $(g \circ f)'(a)$ exists, and is equal to the coefficient of $h$, i.e.,
$$
(g \circ f)'(a) = g’\bigl(f(a)\bigr) f'(a).
$$
One nice feature of this argument is that it generalizes with almost no modifications to vector-valued functions of several variables.