import numpy as np
import pandas as pd
import math
from numpy.random import default_rng
rng = default_rng(42)
from scipy.stats import norm
import matplotlib as mpl
from matplotlib import pyplot as plt
import plotly.graph_objects as go
from tqdm import tqdm
mpl.rcParams['font.size'] = 18
Prove Theorem 7.3.
Solution:
\begin{align*} E\left[\hat{F}_n(x) \right] &= \frac{1}{n} \sum_{i=1}^n E[\mathbb{1}_{X_i \le x}] \\ &= \frac{1}{n} \sum_{i=1}^n P(X_i \le x) \\ &= \frac{1}{n} \sum_{i=1}^n F(x) = F(x) \end{align*}\begin{align*} V\left[\hat{F}_n(x)\right] &= E\left[(\hat{F}_n(x))^2 \right] - E\left[\hat{F}_n(x) \right]^2 \\ &= E\left[\frac{1}{n^2} \left(\sum_{i=1}^n \mathbb{1}_{X_i \le x}^2 + \sum_{i=1}^n \sum_{j \ne i} \mathbb{1}_{X_i \le x} \mathbb{1}_{X_j \le x}\right) \right] - F(x)^2\\ &= \frac{1}{n^2} \left(\sum_{i=1}^n F(x) + \sum_{i=1}^n \sum_{j \ne i} F(x) F(x)\right) - F(x)^2 \tag{$\mathbb{1}_{X_i \le x}^2 = \mathbb{1}_{X_i \le x}$}\\ &= \frac{1}{n^2} \left( nF(x) + n(n-1)F(x)^2\right) - F(x)^2 = \frac{F(x)(1-F(x))}{n} \end{align*}By the bias-variance decomposition, the mean squared error is equal to the variance, which decays to $0$ as $n\to\infty$.
Since the mean squared error converges to $0$, $\hat{F}_n(x) \xrightarrow{qm} F(x)$, and therefore, , $\hat{F}_n(x) \xrightarrow{P} F(x)$.
Let $X_1, \dots, X_n \sim \text{Bernoulli}(p)$ and let $Y_1, \dots, Y_m \sim \text{Bernoulli}(q)$. Find the plug-in estimator and estimated standard error for $p$. Find an approximate 90 percent confidence interval for $p$. Find the plug-in estimator and estimated standard error for $p - q$. Find an approximate 90 percent confidence interval for $p - q$.
Solution:
The plug-in estimator for $p$ is $\bar{X}_n$, which has variance $p(1-p) / n$. The estimated standard error is then $\sqrt{\bar{X}_n(1-\bar{X}_n) / n}$, and the 90% confidence interval is then $\bar{X}_n \pm z_{0.05}\sqrt{\bar{X}_n(1-\bar{X}_n) / n}$ ($z_{0.05} \approx 1.645$).
The plug-in estimator for $p - q$ is $\bar{X} - \bar{Y}$. The standard error is
\begin{align*} \text{se} &= \sqrt{V(p - q)} \\ &= \sqrt{V(p) + V(q)} \\ &= \sqrt{\frac{p(1-p)}{n} + \frac{q(1-q)}{m}} \end{align*}and the estimated standard error is
\begin{align*} \hat{\text{se}} &= \sqrt{\frac{\hat{p}(1-\hat{p})}{n} + \frac{\hat{q}(1-\hat{q})}{m}}\\ &= \sqrt{ \frac{\bar{X}_n(1-\bar{X}_n)}{n} + \frac{\bar{Y}_m(1-\bar{Y}_m)}{m} }\\ \end{align*}and the 90% confidence interval is
\begin{align*} \bar{X} - \bar{Y} \pm z_{0.05} \sqrt{ \frac{\bar{X}_n(1-\bar{X}_n)}{n} + \frac{\bar{Y}_m(1-\bar{Y}_m)}{m} } \end{align*}Generate 100 observations from a $N(0,1)$ distribution. Compute a 95 percent confidence band for the CDF $F$ (as described in the appendix). Repeat this 1000 times and see how often the confidence band contains the true distribution function. Repeat using data from a Cauchy distribution.
Solution:
Computing a 95% confidence band for the CDF $F$:
n = 100 # number of observations
m = 1 # number of trials
X = np.sort(rng.normal(size=n))
alpha = 0.05
eps = math.sqrt((1 / (2 * n)) * math.log(2 / alpha))
fig = go.Figure()
fig.add_trace(go.Scatter(x=X, y=norm.cdf(X), name='True CDF', marker = {'color': 'black'}))
fig.add_trace(go.Scatter(x=X, y=np.arange(0, 1, 1 / n), name='Empirical Distribution Function'))
fig.add_trace(go.Scatter(x=X, y=np.minimum(np.arange(0, 1, 1 / n) + eps,1), name=f'{100*(1-alpha)}% UB'))
fig.add_trace(go.Scatter(x=X, y=np.maximum(np.arange(0, 1, 1 / n) - eps, 0), name=f'{100*(1-alpha)}% LB'))
fig.update_layout(
title=f"Empirical CDF with Confidence Bands",
xaxis_title=r"x")
fig.show()
Repeating 1000 times and see how often the confidence band contains the true distribution function.
contained_count = 0
m = 1000
for j in range(m):
X = rng.normal(size=n)
U = lambda x : np.minimum(np.sum(X < x) / n + eps, 1)
L = lambda x : np.maximum(np.sum(X < x) / n - eps, 0)
above = all(np.array((list(map(U, X)))) >= np.array(list(map(norm.cdf, X))))
below = all(np.array((list(map(L, X)))) <= np.array(list(map(norm.cdf, X))))
contained_count += (above and below)
print(f'The confidence band contains the true distribution in {100 * contained_count / m}% of the iterations')
The confidence band contains the true distribution in 96.6% of the iterations
Repeating with Cauchy distribution:
n = 100 # number of observations
m = 1 # number of trials
X = np.sort(rng.standard_cauchy(size=n))
alpha = 0.05
eps = math.sqrt((1 / (2 * n)) * math.log(2 / alpha))
fig = go.Figure()
fig.add_trace(go.Scatter(x=X, y=norm.cdf(X), name='True CDF', marker = {'color': 'black'}))
fig.add_trace(go.Scatter(x=X, y=np.arange(0, 1, 1 / n), name='Empirical Distribution Function'))
fig.add_trace(go.Scatter(x=X, y=np.minimum(np.arange(0, 1, 1 / n) + eps,1), name=f'{100*(1-alpha)}% UB'))
fig.add_trace(go.Scatter(x=X, y=np.maximum(np.arange(0, 1, 1 / n) - eps, 0), name=f'{100*(1-alpha)}% LB'))
fig.update_layout(
title=f"Empirical CDF with Confidence Bands",
xaxis_title=r"x")
fig.show()
contained_count = 0
m = 1000
for j in range(m):
X = rng.standard_cauchy(size=n)
U = lambda x : np.minimum(np.sum(X < x) / n + eps, 1)
L = lambda x : np.maximum(np.sum(X < x) / n - eps, 0)
above = all(np.array((list(map(U, X)))) >= np.array(list(map(norm.cdf, X))))
below = all(np.array((list(map(L, X)))) <= np.array(list(map(norm.cdf, X))))
contained_count += (above and below)
print(f'The confidence band contains the true distribution in {100 * contained_count / m}% of the iterations')
The confidence band contains the true distribution in 21.8% of the iterations
Let $X_1, \dots, X_n \sim F$ and let $\hat{F}_n(x)$ be the empirical distribution function. For a fixed $x$, use the central limit theorem to find the limiting distribution of $\hat{F}_n(x)$.
Solution:
Fixing $x$, and defining the random variable $Y_i = \mathbb{1}_{X_i \le x}$, we can express $\hat{F}_n(x)$ as a sample average:
$$\hat{F}_n(x) = \frac{1}{n} \sum_{i=1}^n \mathbb{1}_{X_i \le x} = \bar{Y}_n$$Since $E[Y_i] = F(x)$ and $V(Y_i) = F(x)(1-F(x))$, the CLT implies $\hat{F}_n(x)$ is distributed, approximately, as $N \left( F(x), \frac{F(x)(1-F(x))}{n} \right)$.
Let $x$ and $y$ be two distinct points. Find $\text{Cov}(\hat{F}_n(x), \hat{F}_n(y))$.
Solution:
\begin{align*} \text{Cov}(\hat{F}_n(x), \hat{F}_n(y)) &= E \left[\hat{F}_n(x)\hat{F}_n(y) \right] - E[\hat{F}_n(x)]E[\hat{F}_n(y)] \\ &= E \left[\frac{1}{n^2} \left(\sum_{i=1}^n \mathbb{1}_{X_i \le x}\right) \left( \sum_{i=1}^n \mathbb{1}_{X_i \le y} \right)\right] - F(x)F(y) \\ &= \frac{1}{n^2} \left( \sum_{i=1}^n \sum_{j=1}^n P(X_i \le x \cap X_j \le y) \right) - F(x)F(y) \\ &= \frac{1}{n^2} \left( \sum_{i=1}^n \sum_{j=1}^n P(X_i \le x \mid X_j \le y)P(X_j \le y) \right) - F(x)F(y) \\ &= \frac{1}{n^2} \left( \sum_{i=1}^n F(\min\{x,y\}) + \sum_{i=1}^n \sum_{j \ne i} F(x)F(y) \right) - F(x)F(y) \\ &= \frac{1}{n}F(\min\{x,y\}) - \frac{1}{n}F(x)F(y) \end{align*}If $x \le y$, \begin{align*} \text{Cov}(\hat{F}_n(x), \hat{F}_n(y)) &= \frac{F(x)(1-F(y))}{n} \end{align*}
Let $X_1, \dots, X_n \sim F$ and let $\hat{F}$ be the empirical distribution function. Let $a < b$ be fixed numbers and define $\theta = T(F) = F(b) - F(a)$. Let $\hat{\theta} = T(\hat{F}_n) = \hat{F}_n(b) - \hat{F}_n(a)$. Find the estimated standard error of $\hat{\theta}$. Find an expression for an approximate $1 - \alpha$ confidence interval for $\theta$.
Solution: Using the results of Problem #1 and #5: \begin{align*} V(\hat{\theta}) &= V(\hat{F}_n(b) - \hat{F}_n(a)) \\ &= V(\hat{F}_n(b)) + V(\hat{F}_n(a)) - 2Cov(\hat{F}_n(a), \hat{F}_n(b)) \\ &= \frac{F(b)(1-F(b))}{n} + \frac{F(a)(1-F(a))}{n} - 2\frac{F(a)(1-F(b))}{n} \\ &= \frac{1}{n} \left(F(b) - F(a)\right)\left(1-(F(b) - F(a))\right)\\ &= \frac{\theta(1-\theta)}{n} \end{align*}
Thus, the estimated standard error is
\begin{align*} \hat{se} = \sqrt{\frac{\hat{\theta}(1-\hat{\theta})}{n}} \end{align*}and a $1-\alpha$ confidence interval is given by $\hat{\theta} \pm z_{\alpha/2} \sqrt{\frac{\hat{\theta}(1-\hat{\theta})}{n}}$.
Data on the magnitudes of earthquakes near Fiji are available on the website for this book. Estimate the CDF $F(x)$. Compute and plot a 95 percent confidence envelope for $F$ (as described in the appendix). Find an approximate 95 percent confidence interval for $F(4.9) - F(4.3)$.
Solution:
df = pd.read_csv('data/fijiquakes.txt', delim_whitespace=True)
magnitudes = np.sort(df['mag'].to_numpy())
n = magnitudes.size
alpha = 0.05
eps = math.sqrt((1 / (2 * n)) * math.log(2 / alpha))
F_hat = lambda x : np.sum(magnitudes < x) / n
U = lambda x : np.minimum(F_hat(x) + eps, 1)
L = lambda x : np.maximum(F_hat(x) - eps, 0)
plt.plot(np.sort(magnitudes), np.array(list(map(F_hat, magnitudes))), label='Empirical CDF')
plt.plot(np.sort(magnitudes), np.array(list(map(U, magnitudes))), label='UB')
plt.plot(np.sort(magnitudes), np.array(list(map(L, magnitudes))), label='LB')
plt.legend()
plt.grid()
plt.show()
theta_hat = F_hat(4.9) - F_hat(4.3)
se_hat = math.sqrt(theta_hat * (1 - theta_hat) / n)
z = norm.ppf(1-alpha/2)
print(f"95% confidence interval: ({theta_hat - z * se_hat:.3f}, {theta_hat + z * se_hat:.3f})")
95% confidence interval: (0.526, 0.588)
Get the data on eruption times and waiting times between eruptions of the old faithful geyser from the course website. Estimate the mean waiting time and give a standard error for the estimate. Also, give a 90% confidence interval for the mean waiting time. Now estimate the median waiting time. In the next chapter we will see how to get the standard error for the median.
Solution:
df = pd.read_csv('data/faithful.txt', skiprows=25, delim_whitespace=True)
waiting_times = df['waiting'].to_numpy()
n = waiting_times.size
mu_hat = np.sum(waiting_times) / n
se_hat = np.std(waiting_times, ddof=1) / np.sqrt(n)
print(f"Estimated mean: {mu_hat:.3f}")
print(f"Estimated standard error: {se_hat:.3f}")
alpha = 0.1
z = norm.ppf(1-alpha/2)
print(f"90% confidence interval for mean: ({mu_hat - z * se_hat:.3f},{mu_hat + z * se_hat:.3f})")
print(f"Estimated median: {np.median(waiting_times):.3f}")
Estimated mean: 70.897 Estimated standard error: 0.824 90% confidence interval for mean: (69.541,72.253) Estimated median: 76.000
100 people are given a standard antibiotic to treat an infection and another 100 are given a new antibiotic. In the first group, 90 people recover; in the second group, 85 people recover. Let $p_1$ be the probability of recovery under the standard treatment, and let $p_2$ be the probability of recovery under the new treatment. We are interested in estimating $\theta = p_1 - p_2$. Provide an estimate, standard error, an 80% confidence interval and a 95% confidence interval for $\theta$.
Solution:
We have $\hat{\theta} = \hat{p}_1 - \hat{p}_2$. From Problem #2, the standard error is
$$\sqrt{\frac{p_1(1-p_1)}{100} + \frac{p_2(1-p_2)}{100}},$$an estimated standard error is
$$\hat{se} = \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{100} + \frac{\hat{p}_2(1-\hat{p}_2)}{100}},$$and a $(1-\alpha)$ confidence interval is given by $\hat{\theta} \pm z_{\alpha/2} \hat{se}$.
p_1_hat = 0.9
p_2_hat = 0.85
theta_hat = p_1_hat - p_2_hat
print(f"Estimate: {theta_hat:.3f}")
se_hat = math.sqrt((1 / 100) * (p_1_hat * (1 - p_1_hat) + p_2_hat * (1 - p_2_hat)))
print(f"Estimated standard error: {se_hat:.3f}")
alpha= 0.2
z = norm.ppf(1 - alpha/2)
print(f"80% confidence interval: ({theta_hat - z * se_hat:.3f},{theta_hat + z * se_hat:.3f})")
alpha= 0.05
z = norm.ppf(1 - alpha/2)
print(f"95% confidence interval: ({theta_hat - z * se_hat:.3f},{theta_hat + z * se_hat:.3f})")
Estimate: 0.050 Estimated standard error: 0.047 80% confidence interval: (-0.010,0.110) 95% confidence interval: (-0.041,0.141)
In 1975, an experiment was conducted to see if cloud seeding produced rainfall. 26 clouds were seeded with silver nitrate and 26 were not. The decision to seed or not was made at random. Get the data from the provided link. Let $\theta$ be the difference in the mean precipitation from the two groups. Estimate $\theta$. Estimate the standard error of the estimate and produce a 95% confidence interval.
Solution:
\begin{align*} \text{se}^2 &= V(\theta)\\ &= V(\mu_1 - \mu_2) \\ &= V(\mu_1) + V(\mu_2) \\ &= \frac{1}{n^2} \sum_{i=1}^n V(X_{1i}) + \frac{1}{n^2} \sum_{i=1}^n V(X_{2i}) \\ &= \frac{1}{n}\left(\sigma^2_1 + \sigma^2_2\right) \end{align*}So the estimated standard error is
\begin{align*} \hat{se} &= \sqrt{\frac{1}{n}\left(\hat{\sigma}^2_1 + \hat{\sigma}^2_2\right)} \end{align*}df = pd.read_csv('data/clouds.txt', skiprows=29, delim_whitespace=True)
df
X_1 = df['Unseeded_Clouds'].to_numpy()
X_2 = df['Seeded_Clouds'].to_numpy()
mu_1_hat = np.mean(X_1)
mu_2_hat = np.mean(X_2)
theta_hat = mu_2_hat - mu_1_hat
print(f"The estimated difference in mean precipiation: {theta_hat:.3f}")
n = X_1.size
sigma_1 = np.std(X_1, ddof=1)
sigma_2 = np.std(X_2, ddof=1)
se_hat = math.sqrt((1 / n) * (sigma_1 ** 2 + sigma_2 ** 2))
print(f"The estimated standard error: {se_hat:.3f}")
alpha = 0.05
z = norm.ppf(1 - alpha / 2)
print(f"95% confidence interval for mean: ({theta_hat - z * se_hat:.3f},{theta_hat + z * se_hat:.3f})")
The estimated difference in mean precipiation: 277.396 The estimated standard error: 138.820 95% confidence interval for mean: (5.314,549.478)