TODO: #1 (arbitrary Bayes risk)

In [1]:

import numpy as np

from matplotlib import pyplot as plt
from scipy.stats import norm
from tqdm.notebook import tqdm

1¶

In each of the following models, find the Bayes risk and the Bayes estimator, using squared error loss.

(a) $X \sim \text{Binomial}(n, p)$, $p \sim \text{Beta}(\alpha, \beta)$.

(b) $X \sim \text{Poisson}(\lambda)$, $\lambda \sim \text{Gamma}(\alpha, \beta)$.

(c) $X \sim N(\theta, \sigma^2)$ where $\sigma^2$ is known and $\theta \sim N(a, b^2)$.

Solution:

(a) Since we are using squared error loss, Theorem 12.8 says the Bayes estimator is the posterior mean.

The posterior is

\begin{align*} f(p \mid X=x) &\propto {n \choose x} p ^ x (1 - p) ^ {n - x} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} p ^ {\alpha - 1}(1 - p) ^ {\beta - 1} \\ &= \propto p^{\alpha + x - 1} (1 - p) ^ {\beta + n - x - 1}. \end{align*}

Thus, $p \mid X \sim \text{Beta}(\alpha + x, \beta + n - x)$, which has mean $\frac{\alpha + x}{\alpha + \beta + n}$.

(b) We show in Chapter 11, #6(a) that $\lambda \mid X \sim \text{Gamma}\left(\alpha + x, \frac{\beta}{\beta + 1}\right)$, which has mean $\frac{(\alpha + x)\beta}{\beta + 1}$.

(c) We show in Chapter 11, #1 that $\theta \mid X \sim N(\bar{\theta}, \tau^2)$, where

$$\bar{\theta} = wX + (1-w)a,$$$$w = \frac{\frac{1}{\text{se}^2}}{\frac{1}{\text{se}^2} + \frac{1}{b^2}}, \frac{1}{\tau^2} = \frac{1}{\text{se}^2} + \frac{1}{b^2},$$

and $\text{se} = \sigma$ is the standard error of the MLE $X$. Hence the Bayes estimator is $\bar{\theta}$.

2¶

Let $X_1, \dots, X_n \sim N(\theta, \sigma^2)$ and suppose we estimate $\theta$ with loss function $L(\theta, \hat{\theta}) = (\theta - \hat{\theta})^2 / \sigma^2$. Show that $\bar{X}$ is admissable and minimax.

Solution:

Since the risk is constant:

\begin{align*} R(\theta, \bar{X}) &= \int \frac{(\theta - \bar{X})^2}{\sigma^2} f(x; \theta) \, dx \\ &= \frac{1}{\sigma^2}\left(\text{bias}^2_{\theta}(\bar{X}) + V_{\theta}(\bar{X}) \right) \\ &= \frac{1}{\sigma^2}\left(0 + \frac{\sigma^2}{n} \right) = \frac{1}{n}, \end{align*}

by Theorem 12.21, it suffices to show that $\hat{\theta}$ is admissable.

We know, from Theorem 12.20, that $\hat{\theta}$ is admissable under squared error loss. Let $R^*(\theta, \hat{\theta})$ be the risk under squared loss. Note that $R^*(\theta, \hat{\theta}) = \sigma^2 R(\theta, \hat{\theta})$. Suppose that $\hat{\theta}$ is inadmissable under $L$. Then, there exists a $\hat{\theta'}$ such that: $$R(\theta, \hat{\theta'}) \le R(\theta, \hat{\theta})\quad \text{for all } \theta$$ and $$R(\theta, \hat{\theta'}) < R(\theta, \hat{\theta})\quad \text{for at least one } \theta$$ But then $$R^*(\theta, \hat{\theta'}) = \sigma^2 R(\theta, \hat{\theta'}) \le \sigma^2 R(\theta, \hat{\theta}) = R^*(\theta, \hat{\theta}) \quad \text{for all } \theta$$ and $$R^*(\theta, \hat{\theta'}) = \sigma^2 R(\theta, \hat{\theta'}) < \sigma^2 R(\theta, \hat{\theta}) = R^*(\theta, \hat{\theta}) \quad \text{for at least one } \theta$$ a contradiction.

3¶

Let $\Theta = \{\theta_1,\dots,\theta_k\}$ be a finite parameter space. Prove that the posterior mode is the Bayes estimator under zero–one loss.

Solution:

By Theorem 12.7, the choice of $\hat{\theta}$ that minimizes the posterior risk is the Bayes estimator. The posterior risk is:

\begin{align*} r(\hat{\theta} \mid x) &= \int L(\theta, \hat{\theta}(x))f(\theta \mid x) \, d\theta \\ &= \sum_{i=1}^k \mathbb{1}_{\hat{\theta} \ne \theta_i} f(\theta_i \mid x) \end{align*}

which is minimized when choosing $\hat{\theta}$ to be the choice of $\theta_i$ that maximizes $f(\theta_i \mid x)$, i.e., the posterior mode.

4¶

Let $X_1, \dots, X_n$ be a sample from a distribution with variance $\sigma^2$. Consider estimators of the form $bS^2$ where $S^2$ is the sample variance. Let the loss function for estimating $\sigma^2$ be

$$L(\sigma^2, \hat{\sigma}^2) = \frac{\hat{\sigma}^2}{\sigma^2} - 1 - \log \left( \frac{\hat{\sigma}^2}{\sigma^2} \right).$$

Find the optimal value of $b$ that minimizes the risk for all $\sigma^2$.

Solution:

Computing the risk:

\begin{align*} R(\sigma^2, bS^2) &= \int L(\sigma^2, bS^2) f(x; \sigma^2) \, dx \\ &= \int \left[\frac{bS^2}{\sigma^2} - 1 - \log \left( \frac{bS^2}{\sigma^2} \right) \right] f(x; \sigma^2) \, dx \\ &= \frac{b}{\sigma^2}\mathbb{E}(S^2) - 1 - \log(b) - \mathbb{E}(\log(S^2)) - \mathbb{E}(\log(\sigma^2)) \\ &= b - 1 - \log(b) + C. \end{align*}

Thus, the derivative of the risk with respect to $b$ is $1 - \frac{1}{b}$. Equating with 0 yields the optimal value $b^* = 1$.

5¶

Let $X \sim \text{Binomial}(n, p)$ and suppose the loss function is

$$L(p, \hat{p}) = \left(1 - \frac{\hat{p}}{p}\right)^2$$

where $0 < p < 1$. Consider the estimator $\hat{p}(X) = 0$. This estimator falls outside the parameter space $(0, 1)$ but we will allow this. Show that $\hat{p}(X)$ is the unique, minimax rule.

Solution: The risk of an estimator $\hat{p}'$ is: \begin{align*} R(p, \hat{p}') &= \int \left(1 - \frac{\hat{p}'}{p}\right)^2 f(x, p) \, dx \\ &= \frac{1}{p^2} \int (p - \hat{p}')^2 f(x, p) \, dx \\ &= \frac{1}{p^2} \left(p^2 - 2p\mathbb{E}(\hat{p}') + \mathbb{E}\hat{p}'^2\right) \\ \end{align*}

Since the moments of $\hat{p}$ are 0, the risk of $\hat{p}$ is 1 for all $p$. To show that $\hat{p}$ is the unique minimax rule, we must show that for any other estimator, $\hat{p}'$, we can identify a $p$ such that the risk is strictly greater than 1. Assume $\hat{p}'$ is restricted to $(0,1)$. Let $a$ and $b$ be the first and second moments of $\hat{p}'$, respectively. Then $a, b \in (0, 1)$, and,

\begin{align*} R(p, \hat{p}') &= \frac{1}{p^2} \left(p^2 - 2pa + b\right) \\ &= 1 - \frac{2a}{p} + \frac{b}{p^2} \end{align*}

To make the risk greater than one, we ensure:

\begin{align*} \frac{b}{p^2} &\ge \frac{2a}{p} \\ \Rightarrow p \le \frac{b}{2a} \end{align*}

6¶

(Computer Experiment.) Compute the risk of the MLE and the James Stein estimator by simulation. Try various values of $n$ and various vectors $\theta$. Summarize your findings.

In [2]:

n = 10000
k_max = 8
K = np.arange(1, k_max + 1, 1)

def loss(theta, theta_hat):
    return np.sum(np.power(theta - theta_hat, 2), axis=1)


thetas = [[np.random.normal(scale=1) for _ in range(k_max)],
          [k for k in range(k_max)]]


theta_names = ['$N(0,1)^k$', 'range(k)']

for theta, theta_name in zip(thetas, theta_names):
    mle_risks = []
    jse_risks = []
    for k in K:
        theta = [np.random.random() for _ in range(k)]

        X = norm.rvs(loc=theta[:k], scale=1, size = (n, k))

        mle = X
        jse = np.max(1 - (k - 2) / np.sum(np.power(X, 2)), 0) * X

        mle_risks.append(np.mean(loss(theta, mle)))
        jse_risks.append(np.mean(loss(theta, jse)))

    plt.plot(K, np.array(mle_risks) - np.array(jse_risks), label=f'theta={theta_name}')
    
plt.grid()
plt.title("(Simulated) Risk of MLE - Risk of James-Stein Estimator")
plt.xlabel("k")
plt.ylabel("Difference in Risk")
plt.legend()
plt.show()

For all vectors $\theta$ tested, the James Stein Estimator has larger risk than the MLE when $k = 1$, the same risk when $k=2$, and smaller risk for $k > 2$. The difference appears to be linear.