Conditional Independence#
In data science, variables often appear related.
However, once we account for the right context, some relationships disappear.
This idea is called conditional independence, and it is important for:
probabilistic reasoning
Naive Bayes classifiers
Bayesian networks
causal thinking
Mathematically: Conditional Independence
Two variables \(X\) and \(Y\) are conditionally independent given \(Z\) if:
\(P(X \mid Y, Z) = P(X \mid Z)\)
Knowing \(Y\) provides no additional information about \(X\) once \(Z\) is known.
Conditional Independence illustration. Source: Nagwa
Intuition#
\(X\) and \(Y\) may look related at first
After conditioning on \(Z\), the relationship disappears
In simple terms, \(Z\) explains the relationship
Example#
Let:
\(X\) = whether a student gets a high exam score
\(Y\) = whether the student attends review sessions
\(Z\) = how much the student studied
Exam scores and review attendance may appear related.
But once we know how much a student studied, attending review sessions may not add new information about the exam score.
In this case, \(X\) and \(Y\) are conditionally independent given \(Z\).
#Small Python Simulation: We simulate a classic structure:
# Z influences both X and Y
# X and Y appear dependent
#but become independent once we condition on Z
import numpy as np
import pandas as pd
np.random.seed(0)
n = 50_000
# Z is a hidden factor
Z = np.random.binomial(1, 0.5, size=n)
# X and Y depend on Z, but not on each other directly
X = np.random.binomial(1, 0.8*Z + 0.2*(1-Z))
Y = np.random.binomial(1, 0.7*Z + 0.3*(1-Z))
df = pd.DataFrame({"X": X, "Y": Y, "Z": Z})
df.head()
# Check dependence vs conditional independence
# P(X=1 | Y=1)
p_x_given_y = df.loc[df["Y"] == 1, "X"].mean()
# P(X=1 | Y=1, Z=1)
p_x_given_y_z1 = df.loc[(df["Y"] == 1) & (df["Z"] == 1), "X"].mean()
# P(X=1 | Z=1)
p_x_given_z1 = df.loc[df["Z"] == 1, "X"].mean()
p_x_given_y, p_x_given_y_z1, p_x_given_z1
(np.float64(0.6200700962816742),
np.float64(0.7999884386380716),
np.float64(0.8002969264104004))
Notice:
If \(P(X \mid Y) \neq P(X)\), then \(X\) and \(Y\) are dependent.
If \(P(X \mid Y, Z) = P(X \mid Z)\), then \(X\) and \(Y\) are conditionally independent given \(Z\).
Once \(Z\) is known, \(Y\) adds no extra information about \(X\).
Data Science Insight
Conditional independence simplifies probability models and is a key assumption in Naive Bayes and Bayesian networks.