# Distance between the product of marginal distributions and the joint distribution

Given a joint distribution $P(A,B,C)$, we can compute various marginal distributions. Now suppose:
\begin{align}
P1(A,B,C) &= P(A) P(B) P(C) \\
P2(A,B,C) &= P(A,B) P(C) \\
P3(A,B,C) &= P(A,B,C)
\end{align}
Is it true that $d(P1,P3) \geq d(P2,P3)$ where d is the total variation distance?

In other words, is it provable that $P(A,B) P(C)$ is a better approximation of $P(A,B,C)$ than $P(A) P(B) P(C)$ in terms of the total variation distance? Intuitively I think it’s true but could not find out a proof.

#### Solutions Collecting From Web of "Distance between the product of marginal distributions and the joint distribution"

(Note: This is a full reworking of my initial answer. Credit is due to user @Did who criticized my original offer and pressured for something to be done about it. So here it is).

The OP provided a specific counter example and hence essentially proved that OP’s initial conjecture does not hold in general. Still the result appears counter-intuitive, because instinctively we think “two sources of divergence should make matters worse than one”. But there is another way to gain intuition here: realize that sometimes “a second divergence may offset the first one”. This is a more general phenomenon — for example in welfare economics, if just one of the conditions for welfare maximization cannot be respected, it has been proven that then we might become better off by violating a second optimality condition, than trying to implement as many as possible (or in more vulgar terms, “if not perfect, then anything goes”).
Since the Question’s conjecture has been disproved, the purpose of my answer is to formalize a bit why it doesn’t hold, and also, try to link intuition to mathematical expressions.

For $P_3(A,B,C)$ to be non-zero, we assume $P(C|{A,B})\gt0$, $P(A|B)\gt0$. Note that all probabilities are taken with respect to the true probability measure $P_3=P$ – the only source of divergence in $P_1$ and $P_2$ is the assumptions about the dependence structure, not measurement.

Decomposing the three distributions we have
\begin{align}
P_1(A,B,C) &= P(A) P(B) P(C) \\
P_2(A,B,C) &= P(A,B) P(C) \\
P_3(A,B,C) &= P(A,B,C) = P(A,B)P(C|{A,B})
\end{align}

Then the TVD measures can be expressed as

$$d(P_1,P_3) = \sup|P_1 – P_3| = \sup|P(A) P(B) P(C) – P(A,B)P(C|{A,B})| \\= \sup|P(A,B)\left[\frac {P(A)}{P(A|B)}P(C) – P(C|{A,B})\right]| \\= \sup P(A,B)\sup|\left[\frac {P(A)}{P(A|B)}P(C) – P(C|{A,B})\right]|$$

and

$$d(P_2,P_3) = \sup|P_2 – P_3| = \sup|P(A,B) P(C) – P(A,B)P(C|{A,B})| \\= \sup|P(A,B)\left[P(C) – P(C|{A,B})\right]| \\= \sup P(A,B)\sup|P(C) – P(C|{A,B})|$$

Let’s write the two TVD’s side by side scaled by their common factor:

$$\bar d(P_1,P_3) = \sup|\left[\frac {P(A)}{P(A|B)}P(C) – P(C|{A,B})\right]|\;,\qquad \bar d(P_2,P_3) = \sup|P(C) – P(C|{A,B})|$$

The $P_2$ TVD is affected only by the one wrong assumption (independent $C$), while the $P_1$ TVD is affected by both, as should be expected. But how is it affected? The second wrong assumption ($A$ and $B$ independent) is “represented” by the factor $\frac {P(A)}{P(A|B)}$ – but which it does not affect multiplicatively the distance $\sup|P(C) – P(C|{A,B})|$, but only one of its boundaries, $P(C)$. And it is the boundary that reflects the other wrong assumption. We will show it in a while, but it is already evident that since we are calculating maximum absolute values here, the effect of this factor is in no way monotonic in the way it enters the expression- it can be higher or lower than unity, and increase or decrease the supremum involved: the second “mistake” in $P_1$ affects the consequences of the first -and it may make the overall distance longer or shorter.

To show this more formally, while clearing also our eyes, we define $a\equiv \frac {P(A)}{P(A|B)} \in (0,M)$ where $M$ is some positive number, and $\; x\equiv P(C)\in(0,1)\;,y\equiv P(C|{A,B})\in (0,1)$.
Then the conjecture $d(P_1,P_3)\ge d(P_2,P_3)$ boils down to whether $\sup|ax-y|\ge \sup|x-y|$.
Now, given that $x \in(0,1)\;,y \in (0,1)$, the maximum possible range of the function $|x-y|$ defines the maximal set $H_2=(0,1)$, a bounded subset of $\Bbb R$. By the same reasoning, the maximum possible range of $ax$ is $(0, M),\;$ and then the maximum possible range of $|ax-y|$ defines the maximal set $H_1=\left(0,\max (M,1)\right)$, which is also a bounded subset of $\Bbb R$. Denote $h_1$ and $h_2$ the actual range sets produced by $P_1$ and $P_2$ respectively. By construction $h_1\subseteq H_1\;,\;h_2\subseteq H_2\;$.
Then we have two cases:
$$Μ\lt 1 \Rightarrow h_1\subseteq H_1\subset H_2\;\Rightarrow\;\sup h_1\le\sup H_1 \lt \sup H_2\;,\; \text{and} \;\sup h_2\le\sup H_2$$
$$Μ\gt 1 \Rightarrow h_2\subseteq H_2\subset H_1\;\Rightarrow\;\sup h_2\le\sup H_2 \lt \sup H_1\;,\ \text{and} \;\sup h_1\le\sup H_1$$

But in neither case can we infer that $\sup h_1\le \sup h_2$ or $\sup h_2\le \sup h_1$ – the magnitude of the factor representing the second wrong assumption cannot determine that. The result is distribution-specific – and since the source of everything is the joint distribution which can describe arbitrary dependence structures as long as it sums up to unity when needed, we conclude that OP’s counter example is not an outlier but representative of the situation: anything goes. Finally, one could thing that the problem may be the distance measure used, the total variational distance. Not really: in OP’s counter example, if one computes the Hellinger distance $H(P,Q) = \frac {1}{\sqrt 2} \left(\sum_i\left[\sqrt p_i-\sqrt q_i\right]^2\right)^\frac 12$ one will find $H(P_1,P_3)=0.13$ while $H(P_2,P_3)=0.31$. Same qualitative result, $P_1$ is “closer” to $P_3$ than $P_2$ is.

I just find the following counter-example. Suppose $A,B,C$ are discrete variables. $A,B$ can each take two values while $C$ can take three values.
The joint distribution $P(A,B,C)$ is:

\begin{array}{cccc}
A & B & C & P(A,B,C) \\
1 & 1 & 1 & 0.1/3 \\
1 & 1 & 2 & 0.25/3 \\
1 & 1 & 3 & 0.25/3 \\
1 & 2 & 1 & 0.4/3 \\
1 & 2 & 2 & 0.25/3 \\
1 & 2 & 3 & 0.25/3 \\
2 & 1 & 1 & 0.4/3 \\
2 & 1 & 2 & 0.25/3 \\
2 & 1 & 3 & 0.25/3 \\
2 & 2 & 1 & 0.1/3 \\
2 & 2 & 2 & 0.25/3 \\
2 & 2 & 3 & 0.25/3 \\
\end{array}

So the marginal distribution $P(A,B)$ is:
\begin{array}{ccc}
A & B & P(A,B) \\
1 & 1 & 0.2 \\
1 & 2 & 0.3 \\
2 & 1 & 0.3 \\
2 & 2 & 0.2 \\
\end{array}

The marginal distributions $P(A), P(B)$ and $P(C)$ are uniform.

So we can compute that:
\begin{align}
d(P1,P3) &= 0.1 \\
d(P2,P3) &= 0.4/3
\end{align}