What exactly is a probability measure in simple words?

Can someone explain probability measure in simple words? This term has been hunting me for my life.

Today I came across Kullback-Leibler divergence. The KL divergence between probability measure P and Q is defined by,

$$KL(P,Q)= \begin{cases}
\int \log\left(\frac{dP} {dQ}\right)dP & \text{if}\ P\ll Q, \\
\infty & \text{otherwise}.
\end{cases}$$

I have no idea what I just read. I looked up probability measure, it refers to probability space. I looked that up, it refers to $\sigma$-algebra. I told myself I have to stop.

So, is probability measure just a probability density but a broader and fancier saying. Am I overlooking a simple concept, or is this topic just that hard? Thanks in advance!

Solutions Collecting From Web of "What exactly is a probability measure in simple words?"

A probability space consists of:

  1. A sample space $X$, which is the set of all possible outcomes of an experiment
  2. A collection of events $\Sigma$, which are subsets of $X$
  3. A function $\mu$, called a probability measure, that assigns to each event in $\Sigma$ a nonnegative real number

Let’s consider the simple example of flipping a coin. In that case, we have $X=\{H,T\}$ for heads and tails respectively, $\Sigma=\{\varnothing,\{H\},\{T\},X\}$, and $\mu(\varnothing)=0$, $\mu(\{H\})=\mu(\{T\})=\frac{1}{2},$ and $\mu(X)=1$. All of this is a fancy way of saying that when I flip a coin, I have a $0$ percent chance of flipping nothing, a $50$ percent chance of flipping heads, a $50$ percent chance of flipping tails, and a $100$ percent chance of flipping something, heads or tails. This is all very intuitive.

Now, getting back to the abstract definition, there are certain natural requirements that $\Sigma$ and $\mu$ must satisfy. For example, it is natural to require that $\varnothing$ and $X$ are elements of $\Sigma$, and that $\mu(\varnothing)=0$ and $\mu(X)=1$. This is just saying that when performing an experiment, the probability that no outcome occurs is $0$, while the probability that some outcome occurs is $1$.

Similarly, it is natural to require that $\Sigma$ is closed under complements, and if $E\in\Sigma$ is an event, then $\mu(E^c)+\mu(E)=1$. This is just saying that when performing an experiment, the probability that event $E$ occurs or doesn’t occur must be $1$.

There are other requirements of $\Sigma$ which make it a $\sigma$-algebra, and other requirements of $\mu$ which make it a (finite) measure, and to rigorously study probability, one must eventually become familiar with these notions.

A probability measure is more like a cumulative distribution function.

It gives, for any set of values, the probability of the random variable being in that set. And of course, it has to be defined in a way that makes sense: if $A \cap B = \emptyset$, then $\mu(A \cup B) = \mu(A) + \mu(B)$, and the probability of the entire range is one, and no set has a negative probability.

Agreed that wikipedia does a poor job getting the basic ideas across; it seems to be written by experts for experts and very jargon-y in many cases….

Pictorially, perhaps picture that you have many items, and a probability measure is a scale telling you the weight of any subset. The total weight of everything you have is always one. If you put a couple items on the scale separately one by one, the sum of their weights will be the same as if you weighed them all together at once.

A funny thing happens with grains of sand: They each have individual weight zero, but when you get a jar of them together (think uncountably many, that’s important!), then they can have a total weight bigger than zero.

Think of grains of sand here as being uncountably many in total, like real numbers in an interval. The above is not true if there are only countably many! But for real numbers, for instance, each number in the interval has probability measure zero, but the whole interval has some positive measure.

Perhaps I can help clarify things a bit without getting super technical.

A probability space is simply the collection of all the possible events that can happen. So, if you are flipping a coin, the probability space $\Omega = \{H, T\}$ since you can only flip heads or tails. The $\sigma$-algebra that was mentioned is also conceptually simple – it groups all of the events in your probability space into a new set (of course, a $\sigma$-algebra has certain properties, but for sake of simplicity I am skipping those). Therefore, an example of a $\sigma$-algebra $F$ on $\Omega$ would be the power set of $\Omega$ (set of all subsets) $F = \{\emptyset , \{H\}, \{T\}, \{H,T\} \}$.

The reasons the $\sigma$-algebra is important is because that is the set of events that a probability measure gives weights to. Therefore, a measurable space $(\Omega, F, P)$ is a probability space, combined with a $\sigma$-algebra on that space, and a probability measure P on the $\sigma$-algebra.

So, a probability measure simply gives weights (probabilities) to each set within the $\sigma$-algebra, where all of these weights must add up to 1, and a few other properties (cumulative additivity for example).

To describe a random variable $X$, we specify what the probaility is that the outcome of $X$ is some value $x$. For example with a fair die and $X$ standing for “the score of one roll of the die”, we’d say $$P(X=1)=P(X=2)=P(X=3)=P(X=4)=P(X=5)=P(X=6)=\frac16$$ and that’s it.
Our $X$ takes values only from the finite set $\Omega=\{1,2,3,4,5,6\}$.

There are also random variables with (countably) infinitely many possible outcomes. For example if $Y$ stands for “the number of throws of a fair coin until head appears the first time, then
$$P(Y=1)=\frac12, P(Y=2)=\frac14, P(Y=3)=\frac18,\ldots $$
The set $\Omega$ of possible outcomes is now $\Omega=\mathbb N$.

And finally there are random variables with uncountably many possible outcomes (e.g. let $Z$ stand for “select a random point uniformly on the unit interval $\Omega:=[0,1]$”). In these cases usually for any individual value $x\in\Omega$, the probaility $P(Z=x)$ is simply zero. Instead, we have positive probability only if we ask for certain infinite subsets of the space $\Omega$ of possible outcomes. For example, we can righteously say $P(\frac12< X<\frac23)=\frac16$.
It would be nice if one could assign a probability value to any subset $S\subseteq \Omega$. However, it usually turns out that this is not possible in a consistent or well-defined manner.
One will still strive to make the collection of sets $S$ for which $P(X\in S)$ is defined/definable as large as possible.
For our example $Z$, we can certainly say $P(X\in S)=b-a$ if $S$ is an interval $[a,b]$ or $]a,b[$ or $]a,b]$ or $[a,b[$ with $0\le a\le b\le 1$. Especially, $P(X\in\emptyset)=0$ and $P(X\in\Omega)=1$.
Also, if $A,B$ are disjoint and $P(X\in A)$ and $P(X\in B)$ make sense, then so does $P(X\in A\cup B)$, namely with the value $P(X\in A\cup B)=P(X\in A)+P(X\in B)$. In fact, if we have sets $A_1,A_2,\ldots$ and know $P(X\in A_n)$ for each $n$, then it turns out to be advisable to have
$$P(\bigcup_{n=1}^\infty A_n)=\sum_{n=1}^\infty P(X\in A_n).$$
This is almost the concept of a $\sigma$-algebra: It is a collection of subsets of a given set $\Omega$. If we are lucky, such as in the finite case or the countable case (at least as it occured with the random variable $Y$ we defined) this collection is the full powerset of $\Omega$, but it may be smaller.
At any rate, it is large enough to be closed under certain operations, among which is the countable union of sets.
And this property is precisely what allows us to formulate the essential properties we want to have for probabilities of a random variable being in a subset of $\Omega$. Any function that assigns to each element of a given $\sigma$-algebra (i.e. to each sufficiently nice subset of $\Omega$) a value between $0$ and $1$ inclusive, such that the basic rules as spelled out above hold for countable unions, complements, the whole space, is then called a probability measure.

One important measure is the Lebesgue measure $\lambda$ on $[0,1]$ (which describes the random variable $Z$ above).
You may know it from integration theory, where it allwows us to generalize (extend) the Riemann integration.
You may know for example, that the expected value of a finite random variable is simply given by
$$\tag1E(X) = \sum_{x\in\Omega}x\cdot P(X=x) $$
or more generally, the expected avalue of a function of $X$
$$\tag2E(f(X))\sum_{x\in\Omega}f(x)\cdot P(X=x).$$
These are just finite sums (hence always work) if $X$ is a finite random variable. If $\Omega$ is countable, we can use the same formulas, but have series instad of sums, and it may happen that the series does not converge.
For example $E(Y)=2$, but $E((-2)^Y)$ does not converge.
It becomes even worse when $P(X=x)=0$ for all $x\in\Omega$ as then the sums/series above simply result in $0$. The sums/series are simply replaced with corresponding integrals
$$E(Z)=\int_0^1 x\,\mathrm dx =\frac12, \qquad E(f(Z))=\int_0^1 f(x)\,\mathrm dx.$$
Again, the second integral does not make sense for every possible $f$, it must be integrable.

The step from sum to (first series and then) integral may look arbitrary, but it is indeed well-founded in measure theory – often enough one adjusts in the other direction and also writes series and sums as integrals (with respect to specific measures).

All this may still not be enough to grasp the formula you posted, but it should help you get started with the introductory texts you already tried to read.