$\beta_k$ for Conjugate Gradient Method

I followed the derivation for the Conjugate Gradient method from the documents shared below



I understand almost all of this, except how $\beta_k$ is derived.

From the Wiki derivation I understand we can get

$$ r_{k+1} = r_k – \alpha_k A d_k$$

reworking this

$$Ad_k = -\frac{1}{\alpha_k} (r_{k+1} – r_k)$$

And $\alpha_k$ is

$$ \alpha_k = \frac{r_k^Tr_k}{d_k^TAd_k} $$

Now, somehow all derivations I see $\beta_k$ declared to be

$$ \beta_k = \frac{r_{k+1}^Tr_{k+1}}{r_k^Tr_k} $$

and imply that it follows

$$ d_{k+1} = r_{k+1} + \beta_k d_k$$

I was not able to verify this however, my algebra must have gone wrong somewhere. Also, it seems $d_k$ and $r_k$ are assumed to be the same thing, I explain that as first direction $p_0$ is the same as the residual $r_0$ and from there we always calculate the next orthogonal direction, so $d_k$ and $r_k$ remain equal.

Any help would be appreciated,

Solutions Collecting From Web of "$\beta_k$ for Conjugate Gradient Method"

There seems to be a tendency for expositions of the conjugate gradient method to get bogged down in the details of arithmetic manipulations of coefficients that hide the bigger picture. A lot of texts that explain the method in detail with formulas have already been cited in your question, your own answer and the comments; I’ll try to describe on a more intuitive level how I think the method can best be understood.

The Wikipedia article, your question and the texts you link to use slightly different index offsets. In the Wikipedia article and in your question the coefficient linking $d_i$ and $d_{i+1}$ is $\beta_i$, whereas in the paper and the notes it’s $\beta_{i+1}$. I’ll go with the notes since I’ll be referring to them every once in a while.

First, as in the notes, it’s helpful to distinguish between the specific choice of the conjugate search directions and the general method of searching along conjugate directions.

So let’s first assume, as in the notes, that we’re given some $A$-orthogonal basis $\{d_i\}$ of search directions, that is, $d_i^\top A d_j=\delta_{ij}$, where $A$ is the positive-definite symmetric matrix of a linear system $Ax=b$ to be solved or of a quadratic function $\frac12x^\top Ax-b^\top x$ to be minimized, or the Hessian of a non-linear optimization problem.

I find it helpful to keep in mind, in addition to the space in which $x$ lives, a space related by $y=A^{1/2}x$, in which the quadratic function becomes $\frac12y^\top y-b^\top A^{-1/2}x$, the level sets become hyperspheres and $A$-orthogonality becomes orthogonality.

A key property to understand is that the first $n$ residuals and the first $n$ search directions span the same space, so it means the same thing that a vector is orthogonal to the first $n$ residuals or to the first $n$ search directions. This is because in each step the search direction is formed as the residual plus a vector from the previously spanned space.

In the $x$ space, the residuals are mutually orthogonal, and in the $y$ space, the search directions are mutually orthogonal. Because the first $n$ residuals and the first $n$ search directions span the same space, this also implies that in the $x$ space the residuals are orthogonal to all previous search directions and in the $y$ space the search directions are orthogonal to all previous residuals.

Each search ends at a minimum along the search direction, or equivalently at a point where the gradient is orthogonal to the search direction. Since the residual is the (negative) gradient, that immediately implies $r_{i+1}\perp d_i$, and it’s rather misleading that the notes rederive this with complicated arithmetic as (23) instead of noting that $\alpha_k$ was chosen so as to make this true. It then follows by induction that $r_{i+1}\perp d_j$ for all $j\le i$, since $r_{i+1}=r_i+A(x_{i+1}-x_i)$, $r_i$ is orthogonal to all $r_j$ by the induction hypothesis, $x_{i+1}-x_i$ is a vector along the search direction $d_i$ and $Ad_i$ is orthogonal to all $d_j$ (by the $A$-orthogonality of the search directions), and thus to all $r_j$.

All this holds for any given $A$-orthogonal basis $\{d_i\}$ of search directions. Now comes the part where we choose this basis as we go along, based on the residuals. So far it’s been a one-way street from the given search directions to the residuals, using the coefficients $\alpha_k$ to choose the local minimum in each step and make the residuals orthogonal; now we also build the search directions from the residuals, using the coefficients $\beta_k$ to make the search directions $A$-orthogonal.

This is the part that often seems a bit magical in the expositions, but it’s rather straightforward if you keep in mind the above. That a single $A$-orthogonalization suffices is a not-quite-so-magical consequence of the orthogonality of the residuals. Just like the mutual orthogonality of the residuals was ensured by induction for earlier residuals and by the choice of $\alpha_k$ for the current pair of residuals, so the mutual $A$-orthogonality of the search directions is ensured by induction for earlier search directions and by the choice of $\beta_k$ for the current pair of search directions.

We want to use the residual $r_i$ to construct a new search direction $d_i$ that’s $A$-orthogonal to all previous $d_j$. As described in the notes, we could do a full Gram-Schmidt-style $A$-orthogonalization to all previous $d_j$, but that’s not necessary because $r_i$ is already $A$-orthogonal to all previous $d_j$ except $d_{i-1}$ (to which it’s orthogonal). This is because the difference between successive residuals $r_{j+1}$ and $r_j$ is proportional to $Ad_j$, and $r_i$ is orthogonal to $r_{j+1}$ and $r_j$ (and thus $A$-orthogonal to $d_j$) unless $i=j+1$. So there’s only a single search direction $d_{i-1}$ to $A$-orthogonalize to, and this determines the parameter $\beta_i$. We want $d_i=r_i+\beta_id_{i-1}$ to be $A$-orthogonal to $d_{i-1}$. Since $Ad_{i-1}$ is proportional to $r_i-r_{i-1}$, that means


Then using the orthogonality of $r_i$ to $r_{i-1}$ and $d_{i-1}$ yields

$$\beta_i=\frac{r_i^\top r_i}{r_{i-1}^\top d_{i-1}}\;,$$

and using $d_{i-1}=r_{i-1}+\beta_{i-1}d_{i-2}$ and $r_{i-1}\perp d_{i-2}$ this can be rewritten as

$$\beta_i=\frac{r_i^\top r_i}{r_{i-1}^\top r_{i-1}}\;.$$

I hope I’ve contributed a little to making things a bit clearer with a minimal amount of arithmetic; I’ve probably repeated a lot of things that you’d already understood, but I thought it would be valuable to have this here for the future even if you’d already answered your original question in the meantime.

This link


pg 4-6 has the best info on this. The key is using the recurrence relationship on the residuals for the numerator (to make $A$ disappear), and using the definition of $\alpha_i$ for the denominator. It’s also best to think of the whole $\beta$ business as a piece of a modified Gram-Schmidt.

In this link


pg 23-24, it is proved that $d_k = r_k$, that is residual equals search direction, so that’s how they were able to use use a pure $r$ based equation in the code.