I’ve been trying to understand $\sigma$-algebras and how it encodes information in context of filtration. While certain parts seem clear and logical, I can’t say I get the whole picture.
I’ll try to explain the counter-intuition I get with the classical example of the coin tossing: the probability space $\Omega = \{ HH, HT, TH, TT \}$ and a r.v. $X(\omega)$ equal to the number of heads.
At times $0$, $1$ and $2$ the available information is represented using $\sigma$-algebras $\mathcal{F}_0=\{\emptyset,\Omega\}$, $\mathcal{F}_1=\{\emptyset, \Omega, \{HH,HT\},\{TH,TT\}\}$ and $\mathcal{F}_2=\{\emptyset, \Omega,\{HH,HT\},\{TH,TT\},\{HH\},\{HT\},\{TH\},\{TT\}\}$.
One can notice that $X(\omega)$ is not measurable with respect to $\mathcal{F}_0$ and $\mathcal{F}_1$, because $X^{-1}((\frac{3}{2}; +\infty))=\{HH\}$. To me it is quite surprising: intuitively $X$ makes perfect sense at all times. In particular it has an expected value at time $0$, which I interpret as that the probability and value of all outcomes $\{\omega\}$ can be computed. How do you think of a non-measurable function?
Here’s another way of expressing the same confusion. The most natural choice of $\sigma$-algebra in a finite discrete case is $\mathcal{F}=2^\Omega$, and it is implicitly used in all elementary probability problems. However, this choice of $\mathcal{F}$ does not reflect the fact that some information is known or unknown, conditional probability does. Does it mean that the statement “$\sigma$-algebra is known information” make sense only in conditioning? Why is it convenient then?
On the third paragraph:
Precisely,
$\sigma(X) \doteq \{X^{-1}(B) | B \in \ \mathscr{B}\} = \{ \emptyset, \Omega, (TT), (TH, HT), (HH), (HH)^C, (TT)^C, (TH, HT)^C \}$
$\sigma(X) \subsetneq \mathscr{F_0}$,
$\sigma(X) \subsetneq \mathscr{F_1}$
Intuitively,
When we say $X(\omega)$ is measurable w/rt $\mathscr{F}_0$ or $\mathscr{F}_1$, we mean to say that the value of X is known at time 0 or time 1, resp. The value of X is the number of heads that come up after 2 future (or past but unknown as of right now) tosses. How can we mere mortals know how many heads there are after just 0 or 1 toss just from this information?
If you somehow have other information perhaps from an otherworldly being or if you are clairvoyant, $\mathscr{F}_0$ or $\mathscr{F}_1$ as given do not apply to you. Hence, X can be measurable at time 0 or time 1 if we are given information.
Let’s say a devil tells you that the result will be either TH or HT. Supposing the devil lied, this means that:
your $\mathscr{F_0}$ (your information at time 0, change it to G if you like) $= \{ \emptyset, \Omega, (HT, TH), (HT, TH)^C \}$.
your $\mathscr{F_1}$ (your information at time 1, change it to G if you like) $= \{ \emptyset, \Omega, (HT, TH), (HT, TH)^C, (HH), (HH)^C, (TT), (TT)^C \}$
X is not your $\mathscr{F_0}$-measurable, but it is your $\mathscr{F_1}$-measurable.
Convince yourself that knowing the result of only one toss will not make X become $\mathscr{F}_1$-measurable (under your new $\mathscr{F}_1$, of course).
Now say an accurate angel tells you that the first toss will be tails and the second toss will be heads. This means that your $\mathscr{F_0}$ (your information at time 0, change it to G if you like) $= 2^{\Omega}$.
It seems to me that the only way X can be $\mathscr{F_0}$-measurable is if $\mathscr{F_0} = 2^{\Omega}$. Intuitively, the only way we can know the values of X before the tosses is if we know what the tosses will be.
$E(X) = 2p$ where p is the probability of heads. If the coin is fair, the expected value is 1. This means that if the coin is fair, and we repeat this experiment several times, we expect the average of the results to be around 1. If the coin is slightly biased towards heads, say $p = 2/3$, then we expect the result to be closer to 2 than to 0 ($E(X) = 4/3$).
So, you’re right in saying that we can compute the probabilities of any of the 4 possible outcomes. But X being measurable at time 0 doesn’t we can compute the probabilities at those times (In fact, part of the very definition of X is knowing what those probabilities are are time 0).
X being measurable means at time 0 or time 1 means that we know that the probabilities of all its preimages are either 0 or 1 (it’s not enough to know the probabilities of all the elements in its preimage) at time 0 or time 1, resp
or equivalently all its preimages are in $\mathscr{F}_0$ or $\mathscr{F}_1$, resp.
On the last paragraph:
A random variable is defined on $(\Omega, \mathscr{F}, \mathbb{P})$. We can choose:
$\mathscr{F} = 2^{\Omega}$, the biggest it can be,
$\mathscr{F} = \sigma(X)$, the smallest it can be or
$\mathscr{F} = \mathscr{G}$, where $\sigma(X) \ \subset \ \mathscr{G} \ \subset \ 2^{\Omega}$.
I think the probability space is to be able to accommodate other random variables. For instance, X and Y can be random variables on $(\Omega, \mathscr{F}, \mathbb{P})$ if $\mathscr{F} = 2^{\Omega}$. If you chose $\mathscr{F} = \mathscr{G}$, where $\sigma(X) \ \subset \ \mathscr{G} \ \subset \ 2^{\Omega}$, it may be that $\sigma(Y) \ \subsetneq \mathscr{F}$. So if it’s not too much trouble, why not just choose $\mathscr{F} = 2^{\Omega}$.
‘What is the purpose of measuring a variable if all information is always known? Where is randomness in that?’
I hope this is clear now. Information is not known at the very beginning. We can compute the probabilities but if not all the probabilities of the preimages are 0 or 1, this does not mean that X is measurable.
‘OK. I think I see your point. Just filling the blanks. Isn’t it logical that in real world the distribution of Xk is most needed at times before k? This requires calculating the measure with respect to the power set (in a simple case), not Fi. It looks like the notion of σ-algebra serves for two purposes: it may contain the events that already happened (i.e. is a storage of information) or it may contain the possible future events (to compute the probabilities). Is it a correct way of thinking, or am I missing something?’
I’ll assume you mean $F_k$ and $X_k$ in copper.hat’s example. In the real world (no longer probability here, I think. statistics here we come), we don’t know the distribution of $X_k$. Here, we do. In the real world, we want to know $X_k$ before time k. We do this by dividing up the sample space even more (practically, this means, I don’t know, reading the news or insider trading maybe?).
Anyway, if we stand at time k, the past refers to $F_k, F_{k-1}, …, F_0$, our information at times k, k-1, …, 0. You can simply use $F_k$ if $F_0 \subseteq \ \dots \ F_{k-1} \ \subseteq \ F_k$ (which means that you never forget past information: past information is always a subset of present and future information). The future is $F_{k+1}, F_{k+2}, \dots$. Sometimes we can define $F = \cup_k F_{k}$, which I guess is omniscience.
Clearly, $\cap_k F_{k} = F_0$, which at anytime, we know the probability that nothing will happen and the probability that something will happen.
PS I think you’re missing some sets in $\mathscr{F}_2$.