In [1]:

import graphviz as gv
import numpy as np
import pandas as pd

from scipy.stats import chi2

from IPython.display import display, Latex

1¶

Consider random variables $(X_1, X_2, X_3)$. In each of the following cases, draw a graph that has the given independence relations.

(a) $X_1 \amalg X_3 \mid X_2$.

(b) $X_1 \amalg X_2 \mid X_3$ and $X_1 \amalg X_3 \mid X_2$.

(c) $X_1 \amalg X_2 \mid X_3$ and $X_1 \amalg X_3 \mid X_2$ and $X_2 \amalg X_3 \mid X_1$.

(a)

In [2]:

g = gv.Graph()
g.edge("X1", "X2")
g.node("X3")
g

Out[2]:

(b)

In [3]:

g = gv.Graph()
g.edge("X2", "X3")
g.node("X1")
g

Out[3]:

(c)

In [4]:

g = gv.Graph()
g.node("X1")
g.node("X2")
g.node("X3")
g

Out[4]:

2¶

Consider random variables $(X_1, X_2, X_3, X_4)$. In each of the following cases, draw a graph that has the given independence relations.

(a) $X_1 \amalg X_3 \mid X_2, X_4$ and $X_1 \amalg X_4 \mid X_2, X_3$ and $X_2 \amalg X_4 \mid X_1, X_3$.

In [5]:

g = gv.Graph()
g.edge("X1", "X2")
g.edge("X2", "X3")
g.edge("X3", "X4")
g

Out[5]:

(b) $X_1 \amalg X_2 \mid X_3, X_4$ and $X_1 \amalg X_3 \mid X_2, X_4$ and $X_2 \amalg X_3 \mid X_1, X_4$.

In [6]:

g = gv.Graph()
g.edge("X1", "X4")
g.edge("X3", "X4")
g.edge("X2", "X4")
g

Out[6]:

(c) $X_1 \amalg X_3 \mid X_2, X_4$ and $X_2 \amalg X_4 \mid X_1, X_3$.

In [7]:

g = gv.Graph()
g.edge("X1", "X2")
g.edge("X2", "X3")
g.edge("X3", "X4")
g.edge("X1", "X4")
g

Out[7]:

3¶

A conditional independence relation between a pair of variables is minimal if it is not possible to use the Separation Theorem to eliminate any variable from the conditioning set, i.e. from the right hand side of the bar (Whittaker, 1990). Write down the minimal conditional independencies from

(a) Figure 18.10.

(b) Figure 18.11.

(c) Figure 18.12.

(d) Figure 18.13.

(a)

$X_1 \amalg X_3 \mid X_2$
$X_1 \amalg X_4 \mid X_2$
$X_3 \amalg X_4 \mid X_2$

(b)

$X_1 \amalg X_3 \mid X_2$
$X_1 \amalg X_4 \mid X_2$
$X_1 \amalg X_4 \mid X_3$
$X_2 \amalg X_4 \mid X_3$

(c)

$X_1 \amalg X_3 \mid X_2, X_4$
$X_2 \amalg X_4 \mid X_1, X_3$

(d)

$X_1 \amalg X_4 \mid X_2, X_3$
$X_1 \amalg X_4 \mid X_2, X_5$
$X_1 \amalg X_5 \mid X_2, X_3$
$X_1 \amalg X_6 \mid X_2, X_3$
$X_1 \amalg X_6 \mid X_3, X_5$
$X_2 \amalg X_6 \mid X_3, X_5$
$X_3 \amalg X_4 \mid X_2, X_5$
$X_4 \amalg X_6 \mid X_2, X_5$
$X_4 \amalg X_6 \mid X_3, X_5$

4¶

Let $X_1, X_2, X_3$ be binary random variables. Construct the likelihood ratio test for

$$H_0: X_1 \amalg X_2 \mid X_3 \qquad \text{versus} \qquad H_1 : X_1 \text{ is not independent of } X_2 \mid X_3$$

Solution:

Recall the likelihood ratio test statistic:

$$ \begin{align*} \lambda &= 2 \log \left( \frac{\sup_{\theta \in \Theta} \mathcal{L}(\theta)}{\sup_{\theta \in \Theta_0} \mathcal{L}(\theta)}\right) \\ &= 2 \left( \mathcal{l}(\hat{\theta}) -\mathcal{l}(\hat{\theta_0})\right) \end{align*} $$

where $\hat{\theta}$ and $\hat{\theta_0}$ are the MLEs under $H_1$ and $H_0$, respectively. Let $p_{ijk}$ be the probability of observing the triple $(i,j,k)$. Let $C_{ijk}$ be the number of observed triples $(i,j,k)$ in the data. Let dots represent summation over the correponding indicies. Then, the MLE of $p_{ijk}$ under $H_1$ is:

$$\hat{p}_{ijk} = C_{ijk} / C_{\cdot\cdot\cdot} = C_{ijk} / n$$

Therefore,

$$\mathcal{l}(\hat{\theta}) = \sum_{i,j,k} C_{ijk} \log \hat{p}_{ijk} = \sum_{i,j,k} C_{ijk} \log \frac{C_{ijk}}{n}$$

Now, under $H_0$, we have:

$$ \begin{align*} p_{ijk} &= \mathbb{P}(X_1 = i, X_2 = j, X_3 = k) \\ &= \mathbb{P}(X_1 = i, X_2 = j \mid X_3 = k) \mathbb{P}(X_3 = k) \\ &= \mathbb{P}(X_1 = i \mid X_3 = k) \mathbb{P}(X_2 = j \mid X_3 = k) \mathbb{P}(X_3 = k) \\ &= p_{i \mid k} p_{j \mid k} p_{k} \end{align*} $$

where $p_{i \mid k}$ is the probability of observing $X_1 = i$ given that $X_3 = k$ is observed. Similarly, $p_{j \mid k}$ is the probability of observing $X_2 = j$ given that $X_3 = k$ is observed. Finally, $p_k$ is the probability of observing $X_3 = k$. Then, the MLE of $p_{ijk}$ under $H_0$ is:

$$ \begin{align*} \hat{p}_{ijk} &= \hat{p}_{i \mid k} \hat{p}_{j \mid k} \hat{p}_{k} \\ &= \frac{C_{i \cdot k}}{C_{\cdot \cdot k}} \frac{C_{\cdot jk}}{C_{\cdot \cdot k}} \frac{C_{\cdot \cdot k}}{n} \\ &= \frac{C_{i \cdot k} C_{\cdot jk} }{n C_{\cdot \cdot k}} \end{align*} $$

Therefore,

$$\mathcal{l}(\hat{\theta_0}) = \sum_{i,j,k} C_{ijk} \log \left( \frac{C_{i \cdot k} C_{\cdot jk} }{n C_{\cdot \cdot k}} \right)$$

and

$$ \begin{align*} \lambda &= 2 \sum_{i,j,k} C_{ijk} \log \left( \frac{C_{ijk} C_{\cdot \cdot k}}{C_{i \cdot k} C_{\cdot jk}} \right). \end{align*} $$

Note the resemblance of the above expression to that in Theorem 15.3 (we merely add an extra summation over $k$).

Note there are 7 degrees of freedom under $H_1$ (since there are 8 parameters which must sum to 1). Meanwhile, there are 5 degrees of freedom under $H_0$ (as there are five parameters: $p_{k}$, $p_{i \mid k =0}$, $p_{i \mid k =1}$, $p_{j \mid k = 0}$, and $p_{j \mid k =1}$). Thus, the asymptotic distribution of $\lambda$ is $\chi^2_2$.

5¶

Here are breast cancer data from Morrison et al. (1973) on diagnostic center ($X_1$), nuclear grade ($X_2$), and survival ($X_3$):

$$ \begin{array}{cccccc} \hline & X_2 & \text{malignant} & \text{malignant} & \text{benign} & \text{benign} \\ & X_3 & \text{died} & \text{survived} & \text{died} & \text{survived} \\ \hline X_1 & \text{Boston} & 35 & 59 & 47 & 112 \\ & \text{Glamorgan} & 42 & 77 & 26 & 76 \\ \hline \end{array} $$

(a) Treat this as a multinomial and find the maximum likelihood estimator.

(b) If someone has a tumor classified as benign at the Glamorgan clinic, what is the estimated probability that they will die? Find the standard error for this estimate.

(c) Test the following hypotheses:

$$ X_1 \perp X_2 \mid X_3 \qquad \text{versus} \qquad X_1 \not\perp X_2 \mid X_3 $$$$ X_1 \perp X_3 \mid X_2 \qquad \text{versus} \qquad X_1 \not\perp X_3 \mid X_2 $$$$ X_2 \perp X_3 \mid X_1 \qquad \text{versus} \qquad X_2 \not\perp X_3 \mid X_1 $$

Use the test from question 4. Based on the results of your tests, draw and interpret the resulting graph.

Solution:

Encode $\text{Boston} \leftrightarrow 0$, $\text{Glamorgan} \leftrightarrow 1$, $\text{malignant} \leftrightarrow 0$, $\text{benign} \leftrightarrow 1$, and $\text{died} \leftrightarrow 0$, $\text{survived} \leftrightarrow 1$.

In [8]:

counts = pd.DataFrame(
    data=[[35, 59, 47, 112], [42, 77, 26, 76]],
    index=["Boston", "Glamorgan"],
    columns=[
        ["malignant", "malignant", "benign", "benign"],
        ["died", "survived", "died", "survived"],
    ],
)
counts

Out[8]:

	malignant		benign
	died	survived	died	survived
Boston	35	59	47	112
Glamorgan	42	77	26	76

(a) Treating as a multinomial, the MLE are simply the proportions:

In [9]:

mle = counts / sum(sum(counts.values))
mle

Out[9]:

	malignant		benign
	died	survived	died	survived
Boston	0.073840	0.124473	0.099156	0.236287
Glamorgan	0.088608	0.162447	0.054852	0.160338

(b) In this case, the probability is

$$ \begin{align*} \tau = P(X_3 = \text{died} \mid X_1 &= 1, X_2 = 1) \\ &= \frac{P(X_1 = 1, X_2 = 1, X_3 = 0)}{P(X_1 = 1, X_2 = 1, X_3 = 1) + P(X_1 = 1, X_2 = 1, X_3 = 0)} \\ &\approx \frac{\hat{p}_{110}}{\hat{p}_{111} + \hat{p}_{110}} = \frac{0.054852}{0.054852 + 0.160338} \approx 0.25490 = \hat{\tau} \end{align*} $$

We'll find the standard error using the delta method. The log-likelihood is:

$$\mathcal{l}(p) = \sum_{ijk} C_{ijk} \log(p_{ijk})$$

and so the Hessian is

$$ H = - \begin{bmatrix} C_{000} / p_{000}^2 & 0 & \dots & 0 \\ 0 & C_{001} / p_{001}^2 & 0 \dots & 0 \\ \vdots & \vdots & \vdots & \vdots \\ 0 & \dots & 0 & C_{111} / p_{111}^2 \\ \end{bmatrix} $$

which has negated expected value:

$$ I = -\mathbb{E}[H] = \begin{bmatrix} 1/ p_{000} & 0 & \dots & 0 \\ 0 & 1/ p_{001} & 0 \dots & 0 \\ \vdots & \vdots & \vdots & \vdots \\ 0 & \dots & 0 & 1 / p_{111} \\ \end{bmatrix} $$

Therefore,

$$ J_n = I_n^{-1} = (nI)^{-1} = \frac{1}{n} \begin{bmatrix} p_{000} & 0 & \dots & 0 \\ 0 & p_{001} & 0 \dots & 0 \\ \vdots & \vdots & \vdots & \vdots \\ 0 & \dots & 0 & p_{111} \\ \end{bmatrix} $$

Now, let $g(p) = \frac{p_{110}}{p_{111} + p_{110}}$. We have:

$$ \nabla g = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ \frac{p_{111}}{(p_{111} + p_{110})^2} \\ \frac{p_{110}}{(p_{111} + p_{110})^2} \end{bmatrix} $$

The delta method says the estimated standard error of $\hat{\tau}$ is

$$ \begin{align*} \hat{\texttt{se}}(\hat{\tau}) &= \sqrt{(\hat{\nabla} g)^T \hat{J}_n (\hat{\nabla} g)} \tag{hats indicate evaluation at the MLE}\\ &= \sqrt{\frac{\hat{p}_{110}^2 \hat{p}_{111}}{n (\hat{p}_{111} + \hat{p}_{110})^4} + \frac{\hat{p}_{110} \hat{p}_{111}^2}{n (\hat{p}_{111} + \hat{p}_{110})^4}} \\ &= \sqrt{\frac{\hat{p}_{110}\hat{p}_{111}}{n(\hat{p}_{111} + \hat{p}_{110})^3}} \\ &= \sqrt{\frac{C_{110} C_{111}}{(C_{110} + C_{111})^3}} \\ &= \frac{1}{C_{110} + C_{111}} \sqrt{\frac{C_{110} C_{111}}{(C_{110} + C_{111})}} \\ \approx 0.043 \end{align*} $$

Note that we also could have computed the standard error by recognizing $\tau$ is also the parameter for the Bernoulli distributed $X_3 \mid X_1 = 1, X_2 = 1$, and using the familiar formula $\frac{1}{n'}\sqrt{\tau(1 - \tau)}$, where $n' = C_{110} + C_{111}$.

(c)

In [10]:

counts = [[[35, 59], [47, 112]], [[42, 77], [26, 76]]]


def C(i, j, k):
    return counts[i][j][k]

Testing:

$$ X_1 \perp X_2 \mid X_3 \qquad \text{versus} \qquad X_1 \not\perp X_2 \mid X_3 $$

In [11]:

lam = 0
for i in [0, 1]:
    for j in [0, 1]:
        for k in [0, 1]:
            num = C(i, j, k) * (
                C(0, 0, k) + C(0, 1, k) + C(1, 0, k) + C(1, 1, k)
            )
            den = (C(i, 0, k) + C(i, 1, k)) * (C(0, j, k) + C(1, j, k))
            lam += 2 * C(i, j, k) * np.log(num / den)

display(Latex(f"$\lambda = {lam:.4f}$"))
display(Latex(f"$\chi^2_{{2, \\alpha}} = {chi2.ppf(0.95, 2):.4f}$"))

$\lambda = 13.7812$

$\chi^2_{2, \alpha} = 5.9915$

Since $\lambda > \chi^2_{2, \alpha}$ we reject the null hypothesis that $X_1 \amalg X_2 \mid X_3$.

Testing:

$$ X_1 \perp X_3 \mid X_2 \qquad \text{versus} \qquad X_1 \not\perp X_3 \mid X_2 $$

In [12]:

lam = 0
for i in [0, 1]:
    for j in [0, 1]:
        for k in [0, 1]:
            num = C(i, j, k) * (
                C(0, j, 0) + C(0, j, 1) + C(1, j, 0) + C(1, j, 1)
            )
            den = (C(0, j, k) + C(1, j, k)) * (C(i, j, 0) + C(i, j, 1))
            lam += 2 * C(i, j, k) * np.log(num / den)

display(Latex(f"$\lambda = {lam:.4f}$"))
display(Latex(f"$\chi^2_{{2, \\alpha}} = {chi2.ppf(0.95, 1):.4f}$"))

$\lambda = 0.6001$

$\chi^2_{2, \alpha} = 3.8415$

Since $\lambda < \chi^2_{2, \alpha}$, we cannot reject the null hypothesis. Finally, testing:

$$ X_2 \perp X_3 \mid X_1 \qquad \text{versus} \qquad X_2 \not\perp X_3 \mid X_1 $$

In [13]:

lam = 0
for i in [0, 1]:
    for j in [0, 1]:
        for k in [0, 1]:
            num = C(i, j, k) * (
                C(i, 0, 0) + C(i, 0, 1) + C(i, 1, 0) + C(i, 1, 1)
            )
            den = (C(i, 0, k) + C(i, 1, k)) * (C(i, j, 0) + C(i, j, 1))
            lam += 2 * C(i, j, k) * np.log(num / den)

display(Latex(f"$\lambda = {lam:.4f}$"))
display(Latex(f"$\chi^2_{{2, \\alpha}} = {chi2.ppf(0.95, 2):.4f}$"))

$\lambda = 4.0724$

$\chi^2_{2, \alpha} = 5.9915$

Once again, we cannot reject the null hypothesis. Thus we have evidence of the following statements:

$$ \begin{align*} X_1 \not\perp X_2 \mid X_3 \\ X_1 \perp X_3 \mid X_2 \\ X_2 \perp X_3 \mid X_1 \end{align*} $$

Expressed as an undirected graph, we have the following:

In [14]:

g = gv.Graph()
g.edge("X1", "X2")
g.node("X3")
g

Out[14]:

From this graph, we see that $X_1$ and $X_2$ are not independent, $X_1$ and $X_3$ are independent, and $X_2$ and $X_3$ are independent.