Why is variance squared?

The mean absolute deviation is:

$$\dfrac{\sum_{i=1}^{n}|x_i-\bar x|}{n}$$

The variance is: $$\dfrac{\sum_{i=1}^{n}(x_i-\bar x)^2}{n-1}$$

  • So the mean deviation and the variance are measuring the same thing, yet variance requires squaring the difference. Why? Squaring always gives a positive value, so the sum won’t be zero, but absolute value also gives a positive value.
  • Why isn’t it $|x_i-\bar x|^2$, then? Squaring just enlarges, why do we need to do this?

A similar question is here, but mine is a little different.


Solutions Collecting From Web of "Why is variance squared?"

They don’t measure the same thing. To see this, think about physical units.

Suppose the value of $x$ is measured in seconds. For example, $n$ people do a 100-meter race and the values $x_i$ are how many seconds it took each one to finish.

The formula $|x_i – \bar x|$ measures the difference of two times, so it’s also measured in seconds.

The mean absolute deviation is therefore an average of second-values, so it’s also measured in seconds.

However, the formula $(x_i – \bar x)^2$ squares the difference of two times, so it’s measured in seconds squared. The variance is therefore also in seconds squared. They don’t belong to the same physical space of variables, so they measure different things.

The standard deviation, however (the square root of the variance) is again measured in seconds, so it measures something similar (at least, physically similar).

As for why we like the square-root-of-average-of-squares better than the average-of-absolute-values – the square has better mathematical properties, as shown in other answers and in the link you referred to (particularly Rich’s answer).

A late answer, just for completeness with a different view on the thing.

You might look at your data as measured in a multidimensional space, where each subject is a dimension and each item is a vector in that space from the origin towards the items’ measurement over the full subject’s space.
Additional remark: this view of things has an additional nice flavour because it uncovers the condition, that the subjects are assumend independent of each other. This is to have the data-space euclidean; changes in that independence-condition require then changes in the mathematics of the space: it has correlated (or “oblique”) axes.

Now the distance of one vector-arrowhead to another is just the formula for distances in the Euclidean space, the squarerroot of squares of distances-of-coordinates (from the Pythagorean theorem) : $$d = \sqrt { (x_1-y_1)^2+(x_2-y_2)^2+ \cdots+(x_n-y_n)^2}$$ And the standard-deviation is that value, normed by the number of subjects, if the mean-vector is taken as the $y$-vector.
$$\text{sdev} = \sqrt { {(x_1- \bar x)^2 +(x_2-\bar x)^2+ \cdots +(x_n-\bar x)^2 \over n} }$$

They don’t measure the same thing. The mean absolute deviation and standard deviation measure the same thing (notice the similarity of their names).

The variance is convenient because it satisfies the property that the variance of independent random variables is the sum of the variances.

First of all $|\cdot|^2$ is exactly the same with $(\cdot)^2$ for real $x$. As you mentioned they have some similar characteristics but for many problems coming out of optimization involving Gaussian densities, the optimum result is achieved by squaring. You might want to have a look at viterbi detector for example or lets give another example from estimation theory, which is the energy detector.

One can still use the sample absolute deviation instead of sample variance and can obtain a very good performance but for the examples which I gave the result will NOT be optimum.

Variance is, as you say, a measure of deviation. Or, rather, standard deviation (the square root of the variance) is a measure of deviation. So it’s really standard deviation and average deviation you ought to compare.

The difference is the following: If $d_i = |x_i-\bar x|$ are the absolute value deviations, then average deviation is
\frac{d_1 + d_2 + \cdots + d_n}{n}
while standard deviation is
\sqrt{\frac{d_1^2 + d_2^2 + \cdots + d_n^2}{n}}
The normal average uses what is called the arithmetic mean, and the standard deviation uses what is called the quadratic mean. It is not very difficult to show that, as long as not all the $d_i$ are equal, the standard deviation is strictly larger.

So standard deviation is more affected by outliers than is the average deviation. That is really all there is to it.

A similar case arises in the linear regression where the “least square method” is used, instead for example of a (fictitious) “least absolut values method”. In that case the reason is that squaring has better properties concerning the derivative (minimizing the variability).

In the above case apply similar reasons, that have to do with estimating the bias (of the corresponding sample measure) or making other calculations such as determining the distribution of a sample statistic. Moreover squaring the absolute value is the same as squaring the value itself, i.e. $$|x_i-\bar x|^2=(x_i-\bar x)^2$$ so that this alteration does not lead to a noticeable difference.

If you don’t have a preference for exactly how you measure deviation, then you should choose the measure that’s easiest to compute with.

The standard deviation — the square root of variance — is rather nice for doing actual computations, because the variance has all sorts of nice properties. e.g. the function defining variance is everywhere differentiable (in fact, it’s analytic), and is additive: i.e. $\operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)$.

There is a very simple explanation for this: it allows for the calculation of analytical solutions for many interesting problems.

As others have pointed out before, $x^2$ is differentiable, whereas $|x|$ is not. Hence, in problems where quadratic terms are present, one can differentiate them to find optimal solutions analytically.

On the other hand, with $|x|$, one often has to resort to numerical schemes to handle the absolute value. Another flip side to using quadratic terms is that the outliers (i.e. large and small $x$ values) have a much higher influence on the $x^2$ terms when compared to their influence on $|x|$. This may be good or bad depending on your application.

You say the variance is $\dfrac{\sum_{i=1}^{n}(x_i-\bar x)^2}{n-1}$.

What if I told you the variance is $\dfrac{\sum_{i=1}^{n}(x_i-\bar x)^2} n$?

You can find both in textbooks. The fact is, dividing by $n-1$ rather than $n$ is properly done (if at all) ONLY when one is estimating the population variance by using a finite sample $x_1,\ldots,x_n$ that is not the whole population. If $x_1,\ldots,x_n$ is the whole population and each point is equally probable, then the variance of that population is given by the second expression about, not the first.

Now here’s the important point:

\operatorname{var}(X_1+\cdots+X_n) = \operatorname{var}(X_1) + \cdots + \operatorname{var}(X_n) \tag 1
if $X_1,\ldots,X_n$ are independent random variables.

That does not work with mean absolute deviation. (It also does not work in the version with $n-1$ instead of $n$.)

Now suppose $n=1800$ and each $X_i$ is the number of “heads” seen on the $i$th coin toss, so $X_i$ is either $0$ or $1$. Then the sum is the number of “heads” in $1800$ tosses. What is the probability that that number is at least $890$ but not more than $905$? To answer that, one approximates the distribution of the number of “heads” by the normal distribution with the same expected value and the same variance. Without the identity $(1)$, one would not know what that variance is! Abraham de Moivre discovered all this in the $18$th century. And that is why standard deviations rather than mean absolute deviations are used.