import math
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from scipy.stats import norm, t
from sklearn.linear_model import LinearRegression
pd.set_option("display.max_rows", 100)
pd.set_option("display.precision", 3)
Prove Theorem 13.4:
The least squares estimates are given by
$$\hat{\beta}_1 = \frac{\sum_{i=1}^n (X_i - \bar{X}_n)(Y_i - \bar{Y}_n)}{\sum_{i=1}^n (X_i - \bar{X}_n)^2},$$$$\hat{\beta}_0 = \bar{Y}_n - \hat{\beta}_1 \bar{X}_n.$$An unbiased estimate of $\sigma^2$ is
$$\hat{\sigma}^2 = \left(\frac{1}{n-2}\right)\sum_{i=1}^n \hat{\epsilon}_i^2.$$Solution:
We first identify the least squares estimates. Writing out the residual sums of squares:
\begin{align*} \text{RSS} &= \sum_{i=1}^n \hat{\epsilon}_i^2 \\ &= \sum_{i=1}^n \left[Y_i - \left(\hat{\beta}_0 + \hat{\beta}_1X_i\right)\right]^2 \end{align*}Equating with 0 the derivative with respect to $\hat{\beta}_0$:
\begin{align*} \frac{\partial \text{RSS}}{\partial \hat{\beta}_0} &= \sum_{i=1}^n -2 \left[Y_i - \left(\hat{\beta}_0 + \hat{\beta}_1X_i\right)\right] = 0 \\ \Rightarrow \sum_{i=1}^n Y_i &= n\hat{\beta}_0 + \hat{\beta}_1 \sum_{i=1}^n X_i \Rightarrow \hat{\beta}_0 = \bar{Y}_n - \hat{\beta}_1\bar{X}_n \end{align*}Making the above substitution, then doing the same for $\hat{\beta}_1$:
\begin{align*} \frac{\partial \text{RSS}}{\partial \hat{\beta}_1} &= \sum_{i=1}^n 2 \left[Y_i - \left(\bar{Y}_n - \hat{\beta}_1\bar{X}_n + \hat{\beta}_1X_i\right)\right](X_i - \bar{X}_n) = 0 \\ &= \sum_{i=1}^n -2 \left[(Y_i - \bar{Y}_n) + \hat{\beta}_1\left(X_i - \bar{X}_n \right)\right](X_i - \bar{X}_n) = 0 \\ \Rightarrow \hat{\beta}_1\sum_{i=1}^n \left(X_i - \bar{X}_n \right)^2 &= \sum_{i=1}^n (Y_i - \bar{Y}_n)(X_i - \bar{X}_n) \\ \Rightarrow \hat{\beta}_1 &= \frac{\sum_{i=1}^n (X_i - \bar{X}_n)(Y_i - \bar{Y}_n)}{\sum_{i=1}^n (X_i - \bar{X}_n)^2}. \end{align*}Now showing $\hat{\sigma}^2$ is an unbiased estimate of $\sigma^2$. We shall prove the general case with $k$ covariates, using this excellent stack exchange answer, and filling in some gaps.
The residuals can be expressed as a linear transformation of the error terms:
\begin{align*} \hat{\epsilon} &= Y - \hat{Y} = Y - X (X^T X) ^ {-1} X^T Y\\ &= (I - X (X^T X) ^ {-1} X^T) Y \\ &= QY = Q(X\beta + \epsilon) = Q\epsilon. \end{align*}We observe some properties of the linear transformation. We have symmetry: \begin{align*} Q^T &= (I - X (X^T X) ^ {-1} X^T)^T \\ &= I^T - X (X^T X) ^ {-1} X^T = Q, \end{align*} idempotence: \begin{align*} Q^2 &= (I - X (X^T X) ^ {-1} X^T)(I - X (X^T X) ^ {-1} X^T) \\ &= I - 2X (X^T X) ^ {-1} X^T + X (X^T X) ^ {-1} X^T X (X^T X) ^ {-1} X^T \\ &= I - X (X^T X) ^ {-1} X^T = Q \\ \end{align*} and, by properties of the trace operator: \begin{align*} \text{tr}(Q) &= \text{tr}(I - X (X^T X) ^ {-1} X^T) \\ &= \text{tr}(I) - \text{tr}(X (X^T X) ^ {-1} X^T) \tag{linearity} \\ &= n - \text{tr}(X^T X (X^T X) ^ {-1}) \tag{invariance under cyclic permutation} \\ &= n - \text{tr}(I_k) = n - k. \end{align*}
Since $Q$ is idempotent, its eigenvalues can only be 0 or 1, since, for any eigenpair $(\lambda, v)$, we have
$$\lambda v = Qv = Q^2v = \lambda^2 v \Rightarrow \lambda = \lambda^2.$$Since the trace of a matrix is the sum of its eigenvalues (counting multiplicities), it must be the case that 1 is a eigenvalue with multiplicity $n - k$ and 0 is an eigenvalue with multiplicity $k$.
Since $Q$ is real and symmetric, it is normal (i.e., it commutes with its conjugate transpose). Then, by the spectral theorem, it is unitarily diagonalizable. That is, there exists a unitary $U$ such that:
$$U^TQU = D = \text{diag}(\underbrace{1,\dots,1}_{n-k \text{ times}},\underbrace{0,\dots,0}_{k \text{ times}}).$$Let $K = U^T \hat{\epsilon}$. Since $U$ is unitary, so is $U^T$, and because $U^T$ is unitary, it is an isometry with respect to the usual norm. Thus, $||K||^2 = ||U^T \hat{\epsilon} || ^2 = ||\hat{\epsilon}||^2 = \sum_{i=1}^n \hat{\epsilon}_i^2$.
Meanwhile, since $\epsilon \sim N(0, \sigma^2)$, $\hat{\epsilon} \sim N(0, \sigma^2 Q)$, and $K \sim N(0, \sigma^2 D)$ (cf. this answer). Thus, $K_{n-k+1} = \dots = K_n = 0$, and $||K||^2 / \sigma^2$ is a sum of $n-k$ independent standard normal random variables. That is, $||K||^2 / \sigma^2 \sim \chi^2_{n-k}.$ Thus,
$$\frac{\sum_{i=1}^n \hat{\epsilon}_i^2}{\sigma^2} \sim \chi^2_{n-k},$$and, taking expectation, we have
$$\mathbb{E}(\hat{\sigma}^2) = \mathbb{E} \left[ \frac{\sum_{i=1}^n \hat{\epsilon}_i^2}{n-k} \right] = \sigma^2.$$We don't need normality of the error terms. Since $\text{Var}(\epsilon) = \sigma^2$, $\text{Var}(\hat{\epsilon}) = \text{Var}(Q\epsilon) = Q^T \sigma^2 Q = Q \sigma^2$, and $\text{Var}(K) = \text{Var}(U^T \hat{\epsilon}) = \sigma^2 U^TQU = \sigma^2D$. Meanwhile. $\mathbb{E}(K) = 0$. Thus, $||K||^2 / \sigma^2$ is the sum of the squares of $n-k$ independent random variables, each with expectation 0 and variance 1. The expectation of such a random variable is $n-k$.
Prove the formulas for the standard errors in Theorem 13.8. You should regard the $X_i$'s as fixed constants.
Solution:
We shall compute the covariance matrix of $\hat{\beta}$, conditional on $X^n$. For notational simplicity, all variances and covariances are assumed to be conditional on $X^n$.
First, note that
\begin{align*} \sum_i \left[(X_i - \bar{X}_n)(Y_i - \bar{Y}_n)\right] &= \sum_i \left[(X_i - \bar{X}_n)Y_i\right] + \bar{Y}_n\sum_i \left[X_i - \bar{X}_n \right] \\ &= \sum_i \left[(X_i - \bar{X}_n)Y_i\right] + \bar{Y}_n\left[n\bar{X}_n - n\bar{X}_n \right] \\ &= \sum_i \left[(X_i - \bar{X}_n)Y_i\right], \end{align*}so
\begin{align*} \hat{\beta_1} = \frac{\sum_i \left[(X_i - \bar{X}_n)Y_i\right]}{\sum_i \left[(X_i - \bar{X}_n)^2 \right]}, \end{align*}and we may compute the variance of $\hat{\beta}_1$: \begin{align*} V(\hat{\beta_1}) &= V \left[ \frac{\sum_i \left[(X_i - \bar{X}_n)Y_i\right]}{\sum_i \left[(X_i - \bar{X}_n)^2 \right]} \right] \\ &= \frac{1}{(ns_X^2)^2} V \left[\sum_i (X_i - \bar{X}_n)Y_i\right] \\ &= \frac{1}{(ns_X^2)^2} V \left[\sum_i (X_i - \bar{X}_n)(\beta_0 + \beta_1 X_i + \epsilon_i)\right] \\ &= \frac{1}{(ns_X^2)^2} V \left[\sum_i (X_i - \bar{X}_n)\epsilon_i\right] \tag{Only $\epsilon_i$ is random}\\ &= \frac{ns_X^2 \sigma^2}{(ns_X^2)^2} = \frac{\sigma^2}{ns_X^2}. \tag{1} \end{align*}
Meanwhile,
\begin{align*} V(\hat{\beta_0}) &= V\left[\bar{Y}_n - \hat{\beta}_1\bar{X}_n\right] \\ &= V(\bar{Y}_n) + \bar{X}_n^2 V(\hat{\beta}_1) - 2 \bar{X}_n \text{Cov}(\bar{Y}_n, \hat{\beta}_1). \end{align*}Computing each term:
\begin{align*} V(\bar{Y}_n) &= V \left[\frac{1}{n} \sum_i Y_i \right] \\ &= \frac{1}{n^2} V \left[ \sum_i \beta_0 + \beta_1 X_i + \epsilon_i \right] = \frac{1}{n^2} V \left[\sum_i \epsilon_i \right] \\ &= \frac{1}{n^2} (n \sigma^2) = \frac{\sigma^2}{n}, \end{align*}\begin{align*} \bar{X}_n^2 V(\hat{\beta}_1) &= \frac{\sigma^2 \bar{X}_n^2 }{ns_X^2}, \end{align*}\begin{align*} \text{Cov}(\bar{Y}_n, \hat{\beta}_1) &= \text{Cov} \left (\left[\frac{1}{n} \sum_i Y_i \right], \left[ \frac{\sum_i \left[(X_i - \bar{X}_n)Y_i\right]}{\sum_i \left[(X_i - \bar{X}_n)^2 \right]} \right] \right) \\ &= \frac{1}{n\sum_i \left[(X_i - \bar{X}_n)^2 \right]} \text{Cov} \left (\sum_i Y_i, \left[\sum_i (X_i - \bar{X}_n)Y_i \right] \right) \\ \end{align*}Now, by the distributive property of covariance: \begin{align*} \text{Cov} \left (\sum_i Y_i, \left[\sum_i (X_i - \bar{X}_n)Y_i \right] \right) &= \sum_i \text{Cov} \left ( Y_i, \left[\sum_j \left[(X_j - \bar{X}_n)Y_j\right] \right] \right) \\ &= \sum_i \sum_j (X_j - \bar{X}_n) \text{Cov} \left ( Y_i, Y_j \right) \\ &= \sum_i (X_i - \bar{X}_n) \sigma^2 = 0. \tag{$\text{Cov} \left ( Y_i, Y_j \right) = \delta_{ij}\sigma^2$}\\ \end{align*}
So,
\begin{align*} V(\hat{\beta_0}) &= \frac{\sigma^2}{n} + \frac{\sigma^2 \bar{X}_n^2 }{ns_X^2} \\ &= \frac{\sigma^2 \sum_i X_i^2}{n^2s_X^2}. \tag{2} \end{align*}Finally, we compute the covariance of the slope and intercept terms:
\begin{align*} \text{Cov}(\hat{\beta_0}, \hat{\beta_1}) &= \text{Cov}(\bar{Y}_n - \hat{\beta}_1 \bar{X}_n, \hat{\beta}_1) \\ &= \text{Cov}(\bar{Y}_n, \hat{\beta}_1) - \bar{X}_n V(\hat{\beta}_1) \\ &= 0 - \frac{\sigma^2 \bar{X}_n}{n s_X^2}. \tag{3} \end{align*}From (1), (2), and (3), we get covariance matrix. The standard errors are obtained by taking the square roots of the corresponding diagonal terms, and the estimated standard errors by then inserting the estimate $\hat{\sigma}$ for $\sigma$.
Consider the regression through the origin model:
$$Y_i = \beta X_i + \epsilon.$$Find the least squares estimate for $\beta$. Find the standard error of the estimate. Find conditions that guarantee that the estimate is consistent.
Solution:
The residual is:
\begin{align*} \hat{\epsilon}_i &= Y_i - \hat{Y}_i \\ &= Y_i - \hat{\beta} X_i; \end{align*}whence we get the residual sums of squares (RSS):
\begin{align*} \sum_i \hat\epsilon_i^2 = \sum_i (Y_i - \hat\beta X_i)^2. \end{align*}Minimizing the RSS with respect to $\hat\beta$ yields the least squares estimate for $\beta$:
\begin{align*} 0 &= \sum_i 2(Y_i - \hat{\beta}^* X_i) X_i \\ \Rightarrow \hat{\beta}^* &= \frac{\sum_i X_i Y_i}{\sum_i X_i^2}, \end{align*}which has variance:
\begin{align*} V(\hat{\beta}) &= V \left[ \frac{\sum_i X_i Y_i}{\sum_i X_i^2} \right] \\ &= \frac{1}{(\sum_i X_i^2 ) ^2} V(\sum_i X_i \epsilon_i) \\ &= \frac{\sigma^2}{\sum_i X_i^2}. \end{align*}Thus, the standard error is
$$\frac{\sigma}{\sqrt{\sum_i X_i^2}}.$$Prove equation 13.25:
$$\text{bias}(\hat{R}_{\text{tr}}(S)) = E(\hat{R}_{\text{tr}}(S)) - R(S) = -2 \sum_{i=1} \text{Cov}(\hat{Y}_i, Y_i).$$Solution:
\begin{align*} E(\hat{R}_{\text{tr}}(S)) - R(S) &= E \left[ \sum_i (\hat{Y}_i - Y_i)^2 \right] - E \left[ \sum_i (\hat{Y}_i - Y_i^*)^2\right] \\ &= \sum_i E\left[\hat{Y}_i^2\right] - 2E\left[\hat{Y}_i Y_i\right] + E \left[Y_i^2\right] - E\left[\hat{Y}_i^2\right] + 2 E \left[ \hat{Y}_i Y_i^* \right] - E \left[ (Y_i^*)^2\right] \end{align*}Note that $E \left[ (Y_i^*)^2\right] = V\left[Y_i^*\right] + E \left[Y_i^*\right]^2 = V[\epsilon_{\text{pred}}] + E[X_i\beta]^2 = V[\epsilon_{\text{tr}}] + E[X_i\beta]^2 = E \left[Y_i^2\right]$. Also, note that $\hat{Y}_i$ and $Y_i^*$ are independent, since the randomness involved in generating each are separate. Thus, $E \left[ \hat{Y}_i Y_i^* \right] = E \left[ \hat{Y}_i \right] E\left[Y_i^* \right]$. Finally, note that $E[Y_i^*] = X_i\beta = E[\hat{Y}_i]$. Therefore,
\begin{align*} E(\hat{R}_{\text{tr}}(S)) - R(S) &= -2 \sum_i E\left[\hat{Y}_i Y_i\right] - E \left[ \hat{Y}_i \right] E\left[Y_i^* \right] \\ &= -2 \sum_{i=1} \text{Cov}(\hat{Y}_i, Y_i). \end{align*}In the simple linear regression model, construct a Wald test for $H_0 : \beta_1 = 17\beta_0$ versus $H_1 : \beta_1 \ne 17\beta_0$.
Solution:
Let $\delta = \beta_1 - 17 \beta_0$. The MLE of $\delta$ is $\hat{\delta} = \hat{\beta}_1 - 17 \hat{\beta}_0$. The variance of $\hat{\delta}$ is:
\begin{align*} V(\hat{\delta}) &= V(\hat{\beta}_1) + 17^2 V(\hat{\beta}_0) \\ &= \frac{\sigma^2}{ns_X^2} + \frac{17^2 \sigma^2 \sum_i X_i^2}{n^2s_X^2}. \end{align*}Therefore the standard error is:
\begin{align*} \frac{\sigma}{\sqrt{n} s_X} \sqrt{1 + 17^2 \sum_i X_i^2} \end{align*}and the Wald test rejects the null hypothesis if the absolute value of
$$W = \frac{\hat{\beta}_1 - 17 \hat{\beta}_0}{\frac{\sigma}{\sqrt{n} s_X} \sqrt{1 + 17^2 \sum_i X_i^2}}$$exceeds $z_{\alpha/2}$.
Get the passenger car mileage data from
http://lib.stat.cmu.edu/DASL/Datafiles/carmpgdat.html
(a) Fit a simple linear regression model to predict MPG (miles per gallon) from HP (horsepower). Summarize your analysis including a plot of the data with the fitted line.
(b) Repeat the analysis but use $\log$(MPG) as the response. Compare the analyses.
As of 2022-12-28, the above link is broken, but links to all data can be found here: https://www.stat.cmu.edu/~larry/all-of-statistics/.
# getting data
!curl -s --insecure https://www.stat.cmu.edu/~larry/all-of-statistics/=data/carmileage.dat --output data/carmileage.dat
!sed -i 's/Subaru Loyale/SubaruLoyale/g' data/carmileage.dat
data = pd.read_csv(
"data/carmileage.dat",
skiprows=28,
sep="\s+",
names=["Name", "VOL", "HP", "MPG", "SP", "WT"],
index_col="Name",
usecols=["Name", "HP", "MPG"],
)
X = data["HP"].values
Y = data["MPG"].values
n = X.size
X_bar = X.mean()
Y_bar = Y.mean()
beta_1_hat = np.dot((X - X_bar), (Y - Y_bar)) / np.linalg.norm(X - X_bar) ** 2
beta_0_hat = Y_bar - beta_1_hat * X_bar
Y_hat = beta_0_hat + beta_1_hat * X
RSS = np.sum(np.power(Y - Y_hat, 2))
sigma_2_hat = (1 / (n - 2)) * RSS
sigma_hat = np.sqrt(sigma_2_hat)
s_X_2 = (1 / n) * np.sum(np.power(X - X_bar, 2))
s_X = np.sqrt(s_X_2)
se_beta_0 = sigma_hat / (s_X * np.sqrt(n)) * np.sqrt(np.sum(np.power(X, 2)) / n)
se_beta_1 = sigma_hat / (s_X * np.sqrt(n))
plt.scatter(X, Y, label="Passenger Car Mileage Data")
xx = np.arange(min(X), max(X))
yy = beta_1_hat * xx + beta_0_hat
plt.plot(
xx, yy, "--", color="orange", label=f"MPG={beta_0_hat:.3f} + {beta_1_hat:.3f} * HP"
)
plt.xlabel("HP")
plt.ylabel("MPG")
plt.grid()
plt.legend()
plt.show()
results = {
"Covariate": ["constant", "HP"],
"Coefficient": [beta_0_hat, beta_1_hat],
"Std. Error": [se_beta_0, se_beta_1],
"t value": [beta_0_hat / se_beta_0, beta_1_hat / se_beta_1],
"p value": [
2 * norm.cdf(-abs(beta_0_hat / se_beta_0)),
2 * norm.cdf(-abs(beta_1_hat / se_beta_1)),
],
}
pd.DataFrame(results).set_index("Covariate")
Coefficient | Std. Error | t value | p value | |
---|---|---|---|---|
Covariate | ||||
constant | 50.066 | 1.569 | 31.900 | 2.700e-223 |
HP | -0.139 | 0.012 | -11.519 | 1.055e-30 |
X = data["HP"].values
Y = np.log(data["MPG"].values)
n = X.size
X_bar = X.mean()
Y_bar = Y.mean()
beta_1_hat = np.dot((X - X_bar), (Y - Y_bar)) / np.linalg.norm(X - X_bar) ** 2
beta_0_hat = Y_bar - beta_1_hat * X_bar
Y_hat = beta_0_hat + beta_1_hat * X
RSS = np.sum(np.power(Y - Y_hat, 2))
sigma_2_hat = (1 / (n - 2)) * RSS
sigma_hat = np.sqrt(sigma_2_hat)
s_X_2 = (1 / n) * np.sum(np.power(X - X_bar, 2))
s_X = np.sqrt(s_X_2)
se_beta_0 = sigma_hat / (s_X * np.sqrt(n)) * np.sqrt(np.sum(np.power(X, 2)) / n)
se_beta_1 = sigma_hat / (s_X * np.sqrt(n))
plt.scatter(X, Y, label="Passenger Car Mileage Data")
xx = np.arange(min(X), max(X))
yy = beta_1_hat * xx + beta_0_hat
plt.plot(
xx,
yy,
"--",
color="orange",
label=f"log(MPG)={beta_0_hat:.3f} + {beta_1_hat:.3f} * HP",
)
plt.xlabel("HP")
plt.ylabel("log(MPG)")
plt.grid()
plt.legend()
plt.show()
results = {
"Covariate": ["constant", "HP"],
"Coefficient": [beta_0_hat, beta_1_hat],
"Std. Error": [se_beta_0, se_beta_1],
"t value": [beta_0_hat / se_beta_0, beta_1_hat / se_beta_1],
"p value": [
2 * norm.cdf(-abs(beta_0_hat / se_beta_0)),
2 * norm.cdf(-abs(beta_1_hat / se_beta_1)),
],
}
pd.DataFrame(results).set_index("Covariate")
Coefficient | Std. Error | t value | p value | |
---|---|---|---|---|
Covariate | ||||
constant | 4.013 | 4.012e-02 | 100.021 | 0.000e+00 |
HP | -0.005 | 3.085e-04 | -14.873 | 4.926e-50 |
Get the passenger car mileage data from
http://lib.stat.cmu.edu/DASL/Datafiles/carmpgdat.html
(a) Fit a multiple linear regression model to predict MPG (miles per gallon) from the other variables. Summarize your analysis.
(b) Use Mallow's $C_p$ to select a best sub-model. To search through the models try (i) forward stepwise, (ii) backward stepwise. Summarize your findings.
(c) Use the Zheng-Loh model selection method and compare to (b).
(d) Perform all possible regressions. Compare $C_p$ and $\text{BIC}$. Compare the results.
Solution:
(a) First performing the regression on the full model, and obtaining an estimate of $\sigma$ via:
$$\hat{\sigma} = \frac{1}{n - k} \sum_{i=1}^n (Y_i - X\beta)^2$$data = pd.read_csv(
"data/carmileage.dat",
skiprows=28,
sep="\s+",
names=["Name", "VOL", "HP", "MPG", "SP", "WT"],
index_col="Name",
)
all_features = ["VOL", "HP", "SP", "WT"]
X = np.array(data[all_features].values)
X = np.concatenate((X, np.ones((X.shape[0], 1))), axis=1)
y = np.array(data["MPG"].values)
n, k = X.shape
beta_hat = np.linalg.inv(X.T @ X) @ X.T @ y
y_pred = X @ beta_hat
eps_hat = y_pred - y
sigma_2_hat = (1 / (n - k)) * np.linalg.norm(eps_hat) ** 2
V = sigma_2_hat * np.linalg.inv(X.T @ X)
plt.scatter(X[:, 1], y, label="True")
plt.scatter(X[:, 1], np.array(y_pred), label="Predicted")
plt.xlabel("HP")
plt.ylabel("MPG")
plt.grid()
plt.legend()
plt.show()
results = {}
results["Feature"] = all_features + ["intercept"]
results["Coefficient"] = beta_hat
results["std. error"] = np.sqrt(np.diag(V))
results["t value"] = beta_hat / results["std. error"]
results["p-value"] = 2 * norm.cdf(-np.abs(results["t value"]))
pd.DataFrame(results).set_index("Feature")
Coefficient | std. error | t value | p-value | |
---|---|---|---|---|
Feature | ||||
VOL | -0.016 | 0.023 | -0.685 | 4.931e-01 |
HP | 0.392 | 0.081 | 4.818 | 1.453e-06 |
SP | -1.295 | 0.245 | -5.290 | 1.224e-07 |
WT | -1.860 | 0.213 | -8.717 | 2.866e-18 |
intercept | 192.438 | 23.532 | 8.178 | 2.890e-16 |
Implementing a class to get the Mallow's $C_p$ value given a set of features, using the estimate of $\sigma$ from the full model. In anticipation of part (d), we also implement a method to calculate the BIC.
full_model_sigma_2_hat = sigma_2_hat
class LinearRegression:
def __init__(self, features):
X = np.array(data[features].values)
self.X = np.concatenate((X, np.ones((X.shape[0], 1))), axis=1)
self.s = self.X.shape[1]
self.beta_hat = np.linalg.inv(self.X.T @ self.X) @ self.X.T @ y
self.eps_hat = self.X @ self.beta_hat - y
self.rss = self.eps_hat.T @ self.eps_hat
self.V = sigma_2_hat * np.linalg.inv(self.X.T @ self.X)
self.se = np.sqrt(np.diag(self.V))
@property
def mallows_cp(self):
return self.rss + 2 * (self.s) * full_model_sigma_2_hat
@property
def bic(self):
return n * math.log(self.rss / n) + self.s * math.log(n)
Performing forward stepwise regression:
def get_score(features):
model = LinearRegression(features)
return model.mallows_cp
print("Subset:", "Score")
selected_features = []
remaining_features = all_features.copy()
best_score = get_score(selected_features)
print(f"{tuple(selected_features)}: {best_score:.3f}")
while True:
candidate_feature = min(
[feature for feature in remaining_features],
key=lambda feature: get_score(selected_features + [feature]),
)
new_score = get_score(selected_features + [candidate_feature])
if new_score < best_score:
best_score = new_score
selected_features.append(candidate_feature)
remaining_features.remove(candidate_feature)
print(f"{tuple(selected_features)}: {best_score:.3f}")
else:
break
Subset: Score (): 8134.148 ('WT',): 1519.372 ('WT', 'SP'): 1463.077 ('WT', 'SP', 'HP'): 1140.391
Performing backward stepwise regression:
print("Subset:", "Score")
selected_features = all_features.copy()
remaining_features = all_features.copy()
best_score = get_score(selected_features)
print(f"{tuple(selected_features)}: {best_score:.3f}")
while True:
candidate_feature = min(
[feature for feature in remaining_features],
key=lambda feature: get_score(list(set(selected_features) - set([feature]))),
)
new_score = get_score(list(set(selected_features) - set([candidate_feature])))
if new_score < best_score:
best_score = new_score
selected_features.remove(candidate_feature)
remaining_features.remove(candidate_feature)
print(f"{tuple(selected_features)}: {best_score:.3f}")
else:
break
Subset: Score ('VOL', 'HP', 'SP', 'WT'): 1160.808 ('HP', 'SP', 'WT'): 1140.391
Both forward and backward stepwise regression select HP, SP, and WT as covariates.
(c) Implementing the Zheng-Loh model selection method:
# order features by the absolute value of their Wald statistic in the full model, in decreasing order
full_model = LinearRegression(all_features)
W = {all_features[i]: full_model.beta_hat[i] / full_model.se[i] for i in range(len(all_features))}
sorted_features = sorted(all_features, key=lambda f: abs(W[f]), reverse=True)
# Find the j that minimizes
# RSS + j * sigma_2_hat * log(n),
# where the RSS is computed using the model constructed
# from the first j elements of the above list
best_score = np.infty
for j in range(len(W) + 1):
model = LinearRegression(sorted_features[:j])
score = model.rss + j * sigma_2_hat * math.log(n)
print(sorted_features[:j], score)
if score < best_score:
best_score, j_hat = (score, j)
zheng_loh_features = sorted_features[:j_hat]
print("Features selected by Zheng Loh method:", tuple(zheng_loh_features))
[] 8107.462560975609 ['WT'] 1524.7987933782406 ['WT', 'SP'] 1500.6151441827385 ['WT', 'SP', 'HP'] 1210.0414343972777 ['WT', 'SP', 'HP', 'VOL'] 1262.5701385588484 Features selected by Zheng Loh method: ('WT', 'SP', 'HP')
The Zheng-Loh model selection method chooses the same covariates as the forward and backward stepwise regression methods.
(d) Performing all possible regressions:
def get_bic(features):
model = LinearRegression(features)
return model.bic
from itertools import combinations
all_features = ["VOL", "HP", "SP", "WT"]
subsets = sum(
[
list(map(list, combinations(all_features, i)))
for i in range(len(all_features) + 1)
],
[],
)
results = []
for subset in subsets:
row = {feature: "\u2713" if feature in subset else "" for feature in all_features}
row["Mallows $C_p$"] = get_score(subset)
row["BIC"] = get_bic(subset)
results.append(row)
df = pd.DataFrame(results)
df.sort_values("Mallows $C_p$", ignore_index=True)
VOL | HP | SP | WT | Mallows $C_p$ | BIC | |
---|---|---|---|---|---|---|
0 | ✓ | ✓ | ✓ | 1140.391 | 225.426 | |
1 | ✓ | ✓ | ✓ | ✓ | 1160.808 | 229.334 |
2 | ✓ | ✓ | ✓ | 1443.795 | 246.530 | |
3 | ✓ | ✓ | 1463.077 | 244.895 | ||
4 | ✓ | ✓ | ✓ | 1507.484 | 250.346 | |
5 | ✓ | ✓ | 1510.678 | 247.670 | ||
6 | ✓ | 1519.372 | 245.267 | |||
7 | ✓ | ✓ | 1542.175 | 249.456 | ||
8 | ✓ | ✓ | ✓ | 2147.887 | 281.220 | |
9 | ✓ | ✓ | 2354.823 | 285.699 | ||
10 | ✓ | ✓ | 2436.579 | 288.594 | ||
11 | ✓ | ✓ | 3056.599 | 307.748 | ||
12 | ✓ | 3102.806 | 305.325 | |||
13 | ✓ | 4318.235 | 332.832 | |||
14 | ✓ | 7059.223 | 373.532 | |||
15 | 8134.148 | 381.100 |
There is good (but not perfect) agreement in the rankings of Mallow's $C_p$ and BIC. Both metrics identify the same set of features as all the other approaches.
Assume a linear regression model with Normal errors. Take $\sigma$ known. Show that the model with highest AIC (equation (13.27)) is the model with the lowest Mallows $C_p$ statistic.
Solution:
The Akaike Information Criterion (AIC) is:
$\mathcal{l}_S - |S|$
where $\mathcal{l}_S$ is the log-likelihood of the model evaluated at the MLE, and $S$ is a subset of the covariates.
The likelihood of a linear regression model with normal errors is
$$\mathcal{L}_S = \prod_{i=1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left[-\frac{1}{2}\left( \frac{\hat{Y}_i(S) - Y_i}{\sigma}\right)^2\right]$$where $\hat{Y}_i(S)$ is the predicted value of observation $i$ by the model, and $Y_i$ the actual value. The log-likelihood is then
$$\mathcal{l}_S = n \log\left[\frac{1}{\sqrt{2 \pi \sigma^2}}\right] - \frac{1}{2\sigma^2}\sum_{i=1}^n (\hat{Y}_i(S) - Y_i)^2.$$The AIC is then
$$-\frac{1}{2\sigma^2}\sum_{i=1}^n (\hat{Y}_i(S) - Y_i)^2 - |S| + C$$where $C$ is a constant that doesn't depend on the choice of model. The maximizer of the above will minimize:
$$\sum_{i=1}^n (\hat{Y}_i(S) - Y_i)^2 + 2 |S| \sigma^2$$which is Mallow's $C_p$ statistic.
In this question we will take a closer look at the AIC method. Let $X_1, \dots, X_n$ be IID observations. Consider two models $\mathcal{M}_0$ and $\mathcal{M}_1$. Under $\mathcal{M}_0$ the data are assumed to be $N(0,1)$ while under $\mathcal{M}_1$ the data are assumed to be $N(\theta, 1)$ for some unknown $\theta \in \mathbb{R}$:
$$\mathcal{M}_0 : X_1, \dots, X_n \sim N(0, 1)$$$$\mathcal{M}_1 : X_1, \dots, X_n \sim N(\theta, 1), \theta \in \mathbb{R}.$$This is just another way to view the hypothesis testing problem: $H_0 : \theta = 0$ versus $H_1 : \theta \ne 0$. Let $\mathcal{l}_n(\theta)$ be the log-likelihood function. The AIC score for a model is the log-likelihood at the MLE minus the number of parameters. (Some people multiply this score by 2 but that is irrelevant.) Thus, the AIC score for $\mathcal{M}_0$ is $AIC_0 = \mathcal{l}_n(0)$ and the AIC score for $\mathcal{M}_1$ is $AIC_1 = \mathcal{l}_n(\hat{\theta}) - 1$. Suppose we choose the model with the highest AIC score. Let $J_n$ denote the selected model:
$$ J_n = \begin{cases} 0 & \text{if } AIC_0 > AIC_1 \\ 1 & \text{if } AIC_1 > AIC_0. \end{cases} $$(a) Suppose that $\mathcal{M}_0$ is the true model, i.e., $\theta = 0$. Find
$$\lim_{n\to\infty} \mathbb{P}(J_n = 0).$$Now compute $\lim_{n\to\infty} \mathbb{P}(J_n = 0)$ when $\theta \ne 0$.
Recall the likelihood for the normal distribution:
\begin{align*} \mathcal{L}_n(\mu, \sigma) &= \prod_{i=1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left[ - \frac{1}{2} \left( \frac{X_i - \mu}{\sigma} \right)^2 \right], \end{align*}whence we get the log-likelihood:
\begin{align*} \mathcal{l}_n(\mu, \sigma) &= - \frac{n}{2} \log \left(2 \pi\right) - \frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \mu)^2. \end{align*}So,
\begin{align*} \mathcal{l}_n(0) &= - \frac{n}{2} \log \left(2 \pi\right) - \frac{1}{2} \sum_{i=1}^n X_i^2, \end{align*}and
\begin{align*} \mathcal{l}_n(\hat{\theta}) &= \mathcal{l}_n(\bar{X}) \\ &= - \frac{n}{2} \log \left(2 \pi\right) - \frac{1}{2} \sum_{i=1}^n (X_i - \bar{X})^2. \end{align*}Therefore,
\begin{align*} \mathbb{P}(J_n = 0) &= \mathbb{P}(AIC_0 > AIC_1) \\ &= \mathbb{P}(\mathcal{l}_n(0) > \mathcal{l}_n(\hat{\theta}) - 1) \\ &= \mathbb{P}(\mathcal{l}_n(\hat{\theta}) - \mathcal{l}_n(0) < 1) \\ &= \mathbb{P} \left( \sum_{i=1}^n X_i^2 - \sum_{i=1}^n ( X_i^2 - 2 \bar{X}X_i + \bar{X}^2) < 2 \right) \\ &= \mathbb{P} \left( 2n\bar{X}^2 - n\bar{X}^2 < 2 \right) \\ &= \mathbb{P} \left( -\sqrt{\frac{2}{n}} < \bar{X} < \sqrt{\frac{2}{n}} \right). \end{align*}Since $\left\{X_i\right\}_{i=1}^n$ are IID and normally distributed with variance 1, $\bar{X}$ is also normally distributed, with mean 0 and variance $\frac{1}{n}$ (Theorem 3.17). Hence:
\begin{align*} \mathbb{P}(J_n = 0) &= \mathbb{P} \left( -\sqrt{n}\sqrt{\frac{2}{n}} < \frac{\bar{X} - 0}{\sqrt{1 / n}} < \sqrt{n}\sqrt{\frac{2}{n}} \right) \\ &= \mathbb{P} \left( -\sqrt{2} < Z < \sqrt{2} \right) \\ &= \Phi(\sqrt{2}) - \Phi(-\sqrt{2}) \approx 0.843. \end{align*}The result holds for all $n \in \mathbb{Z}^+$, and also in the limit.
If $\theta \ne 0$, then the mean of $\bar{X}$ is $\theta$, and
\begin{align*} \mathbb{P}(J_n = 0) &= \mathbb{P} \left( -\sqrt{n}\left(\sqrt{\frac{2}{n}} - \theta\right) < \frac{\bar{X} - \theta}{\sqrt{1 / n}} < \sqrt{n}\left(\sqrt{\frac{2}{n}}-\theta\right) \right) \\ &= \mathbb{P} \left( -\sqrt{2} - \sqrt{n}\theta < Z < \sqrt{2} - \sqrt{n}\theta \right) \\ &= \Phi(\sqrt{2} - \sqrt{n}\theta) - \Phi(-\sqrt{2} - \sqrt{n}\theta) \end{align*}Since $$ \lim_{n\to\infty} \Phi(\sqrt{2} - \sqrt{n}\theta) = \lim_{n\to\infty} \Phi(-\sqrt{2} - \sqrt{n}\theta) = \begin{cases} 0 \text{ if } \theta > 0 \\ 1 \text{ if } \theta < 0, \end{cases} $$
$\lim_{n\to\infty} \mathbb{P}(J_n = 0) = 0$.
(b) The fact that $\lim_{n\to\infty} \mathbb{P}(J_n = 0) \ne 1$ when $\theta = 0$ is why some people say that AIC "overfits." But this is not quite true as we shall now see. Let $\phi_{\theta}(x)$ denote a normal density function with mean $\theta$ and variance 1. Define
$$ \hat{f}_n(x) = \begin{cases} \phi_0(x) &\text{ if } J_n = 0 \\ \phi_{\hat{\theta}}(x) &\text{ if } J_n = 1. \end{cases} $$If $\theta = 0$, show that $D(\phi_0, \hat{f}_n) \xrightarrow{P} 0$ as $n \to \infty$ where
$$D(f, g) = \int f(x) \log \left( \frac{f(x)}{g(x)} \right) \, dx$$is the Kullback-Leibler distance. Show also that $D(\phi_{\theta}, \hat{f}_n) \to 0$ if $\theta \ne 0$. Hence, AIC consistently estimates the true density even if it "overshoots" the correct model.
Solution:
We have
\begin{align*} D(\phi_{\theta}, \hat{f}_n) &= \int \phi_{\theta}(x) \log \left( \frac{\phi_{\theta}(x)}{\hat{f}_n} \right) \, dx \\ &= \int \phi_{\theta}(x) \left( \log(\phi_{\theta}(x)) - \log(\hat{f}_n(x)) \right) \, dx \\ &= \int \phi_0(z) \left(\log(\phi_0(z) - \log(\hat{f}_n(z + \theta)) \right) \, dx \tag{$z = x - \theta$}\\ &= \mathbb{E}\left[\log(\phi_{0}(Z)) - \log(\hat{f}_n(Z + \theta))\right]. \tag{$Z \sim N(0,1)$} \\ \end{align*}Let $\mu =\hat{\theta}\mathbb{1}_{J_n = 1}$. Note that:
\begin{align*} \log(\phi_0(x)) - \log(\hat{f}_n(x + \theta)) &= \left(\frac{1}{\sqrt{2\pi}} - \frac{x^2}{2}\right) - \left(\frac{1}{\sqrt{2\pi}} - \frac{(x + \theta - \mu)^2}{2} \right) \\ &= - \frac{1}{2} \left( x^2 - (x + (\theta - \mu))^2 \right) \\ &= \frac12 \left((\mu - \theta)^2 - 2(\mu - \theta)x\right) \end{align*}Therefore,
\begin{align*} D(\phi_\theta, \hat{f}_n) &= \frac12 \mathbb{E}\left[(\mu - \theta)^2 - 2(\mu - \theta) Z\right] \\ &= \frac{(\mu - \theta)^2}{2}. \end{align*}Suppose $\theta = 0$. Then, $D(\phi_{0}, \hat{f}_n) = \frac{\mu^2}{2}$. Let $\epsilon > 0$. We have:
\begin{align*} \mathbb{P}\left(\frac{\mu^2}{2} > \epsilon\right) &= \mathbb{P} \left((\hat{\theta}\mathbb{1}_{J_n = 1})^2 > \frac{\epsilon}{2} \right) \\ &\le \mathbb{P} \left( |\hat{\theta}| > \sqrt{\frac{\epsilon}{2}} \right) \\ &= \mathbb{P} \left( \left|\frac{\hat{\theta}}{1 / \sqrt{n}}\right| > \sqrt{n}\sqrt{\frac{\epsilon}{2}} \right) \\ &= \mathbb{P} \left( |Z| > \sqrt{\frac{n \epsilon }{2}}\right) \tag{Since $\theta = 0$, $\hat{\theta} \sim N(0, 1 / n)$}\\ &= 2 \Phi \left( - \sqrt{ \frac{n \epsilon}{2}} \right) \to 0. \end{align*}Thus, $D(\phi_0, \hat{f}_n) \xrightarrow{P} 0$.
Now suppose $\theta \ne 0$. Then, $D(\phi_{\theta}, \hat{f}_n) = (\mu - \theta)^2 / 2$. By the Weak Law of Large Numbers, $\hat{\theta} = \bar{X} \xrightarrow{P} \theta$. By the work in part (a), $\mathbb{1}_{J_n = 1} \xrightarrow{P} 1$. Thus, by Theorem 5.5d, $\mu = \hat{\theta}\mathbb{1}_{J_n = 1} \xrightarrow{P} \theta$. Finally, by Theorems 5.5a and 5.5f, $(\mu - \theta)^2 / 2 \xrightarrow{P} 0$.
(c) Repeat this analysis for the BIC which is the log-likelihood minus $(p/2) \log n$ where $p$ is the number of parameters and $n$ is sample size.
In this case, $\text{BIC}_0 = \mathcal{l}_n(0)$ and $\text{BIC}_1 = \mathcal{l}_n(\hat{\theta}) - \frac{\log n}{2}$. The calculation of $\mathbb{P}(J_n = 0)$ follows the same steps, and, if $\theta = 0$, we get:
$$\mathbb{P}(J_n = 0) = \Phi(\sqrt{\log n}) - \Phi(-\sqrt{\log n}) \rightarrow 1,$$and if $\theta \ne 0$, we get
$$\mathbb{P}(J_n = 0) = \Phi(\sqrt{\log n} - \sqrt{n}\theta) - \Phi(-\sqrt{\log n} - \sqrt{n}\theta) \rightarrow 0.$$The Kullback-Leibler divergence calculations are exactly as before.
In this question we take a closer look at prediction intervals. Let $\theta = \beta_0 + \beta_1 X_*$ and let $\hat{\theta} = \hat{\beta}_0 + \hat{\beta}_1 X_*$. Thus, $\hat{Y}_* = \hat{\theta}$ while $Y_* = \theta + \epsilon$. Now, $\hat{\theta} \approx N(\theta, \text{se}^2)$ where
$$\text{se}^2 = V(\hat{\theta}) = V(\hat{\beta}_0 + \hat{\beta}_1 x_*).$$Note that $V(\hat{\theta})$ is the same as $V(\hat{Y}_*)$. Now, $\hat{\theta} \pm 2 \sqrt{V(\hat{\theta})}$ is an approximate 95 percent confidence interval for $\theta = \beta_0 + \beta_1 x_*$ using the usual argument for a confidence interval. But, as you shall now show, it is not a valid confidence interval for $Y_*$.
(a) Let $s = \sqrt{V(\hat{Y}_*)}$. Show that
\begin{align*} \mathbb{P}(\hat{Y}_* - 2s < Y_* < \hat{Y}_* + 2s) &\approx \mathbb{P} \left(-2 < N \left(0, 1 + \frac{\sigma^2}{s^2}\right) < 2 \right) \\ &\ne 0.95 \end{align*}Solution:
Observe \begin{align*} \mathbb{P}(\hat{Y}_* - 2s < Y_* < \hat{Y}_* + 2s) &= \mathbb{P}\left(-2 < \frac{Y_* - \hat{Y}_*}{s} < 2\right) \end{align*} and that \begin{align*} \frac{Y_* - \hat{Y}_*}{s} &= \frac{\theta - \hat{\theta}}{s} + \frac{\epsilon}{s}. \end{align*}
Note that $\hat{\theta}$ and $\epsilon$ are independent, since $\hat{\theta}$ depends only on the training data. Thus, the above quantities converges in distribution to a normal distribution with mean 0 and variance $1 + \frac{\sigma^2}{s^2}$.
(b) The problem is that the quantity of interest $Y_*$ is equal to a parameter $\theta$ plus a random variable. We can fix this by defining
$$\xi_n^2 = V(\hat{Y}_*) + \sigma^2 = \left[ \frac{\sum_i (x_i - x_*)^2}{n\sum_i (x_i - \bar{x})^2} + 1 \right]\sigma^2.$$In practice, we substitute $\hat{\sigma}$ for $\sigma$ and we denote the resulting quantity by $\hat{xi}_n$. Now consider the interval $\hat{Y}_* \pm 2 \hat{\xi_n}$. Show that
$$\mathbb{P}(\hat{Y}_* - 2 \hat{\xi}_n < Y_* < \hat{Y}_* + 2 \hat{\xi}_n) \approx \mathbb{P}(-2 < N(0,1) < 2) \approx 0.95.$$Solution:
Observe:
\begin{align*} \mathbb{P}(\hat{Y}_* - 2 \hat{\xi}_n < Y_* < \hat{Y}_* + 2 \hat{\xi}_n) &= \mathbb{P}\left( -2 < \frac{Y_* - \hat{Y}_*}{\hat{\xi}_n} < 2\right) \end{align*}and that
\begin{align*} \frac{Y_* - \hat{Y}_*}{\hat{\xi}_n} &= \frac{\theta - \hat{\theta} + \epsilon}{\sqrt{s^2 + \sigma^2}} \\ &= \frac{\theta - \hat{\theta}}{s} \frac{s}{\sqrt{s^2 + \sigma^2}} + \frac{\epsilon}{\sigma} \frac{\sigma}{\sqrt{s^2 + \sigma^2}} \\ &\leadsto N \left(0, \frac{s^2}{s^2 + \sigma^2}\right) + N\left(0, \frac{\sigma^2}{s^2 + \sigma^2} \right) = N(0, 1). \end{align*}Get the Coronary Risk-Factor Study (CORIS) data from the book web site. Use backward stepwise logistic regression based on AIC to select a model. Summarize your results.
Solution: Downloading the data:
!curl -s --insecure https://www.stat.cmu.edu/~larry/all-of-statistics/=data/coris.dat --output data/coris.dat
data = pd.read_csv(
"data/coris.dat",
skiprows=3,
names="sbp tobacco ldl adiposity famhist typea obesity alcohol age chd".split(),
)
data.head()
sbp | tobacco | ldl | adiposity | famhist | typea | obesity | alcohol | age | chd | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 160 | 12.00 | 5.73 | 23.11 | 1 | 49 | 25.30 | 97.20 | 52 | 1 |
2 | 144 | 0.01 | 4.41 | 28.61 | 0 | 55 | 28.87 | 2.06 | 63 | 1 |
3 | 118 | 0.08 | 3.48 | 32.28 | 1 | 52 | 29.14 | 3.81 | 46 | 0 |
4 | 170 | 7.50 | 6.41 | 38.03 | 1 | 51 | 31.99 | 24.26 | 58 | 1 |
5 | 134 | 13.60 | 3.50 | 27.78 | 1 | 60 | 25.99 | 57.34 | 49 | 1 |
all_features = data.columns.drop("chd").tolist()
X = data[all_features].values
X = np.concatenate([X, np.ones((X.shape[0], 1))], axis=1)
y = data["chd"].values
Implementing the reweighted least squares algorithm:
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def logit(x):
return np.log(x / (1 - x))
beta_hat = {}
p = {}
s = 0
tol = 1e-9
beta_hat[0] = np.zeros(X.shape[1])
while True:
p[s] = sigmoid(X @ beta_hat[s])
Z = logit(p[s]) + (y - p[s]) / (p[s] * (1 - p[s]))
W = np.diag(p[s] * (1 - p[s]))
s += 1
beta_hat[s] = np.linalg.inv(X.T @ W @ X) @ X.T @ W @ Z
if np.linalg.norm(beta_hat[s] - beta_hat[s-1]) < tol:
break
beta_hat_final = beta_hat[s]
We get the coefficients:
dict(zip(all_features + ['intercept'], beta_hat_final))
{'sbp': 0.006504017125714047, 'tobacco': 0.07937644573028917, 'ldl': 0.17392389811148842, 'adiposity': 0.018586568160066826, 'famhist': 0.9253704193666236, 'typea': 0.03959502497737503, 'obesity': -0.06290986927786987, 'alcohol': 0.00012166240142633388, 'age': 0.04522534963462028, 'intercept': -6.150720864983774}
Implementing a class to get the AIC for various models:
class LogisticRegression:
def __init__(self, features):
X = data[features].values
self.X = np.concatenate([X, np.ones((X.shape[0], 1))], axis=1)
y = data["chd"].values
beta_hat = {}
p = {}
s = 0
tol = 1e-9
beta_hat[0] = np.zeros(self.X.shape[1])
while True:
p[s] = sigmoid(self.X @ beta_hat[s])
Z = logit(p[s]) + (y - p[s]) / (p[s] * (1 - p[s]))
W = np.diag(p[s] * (1 - p[s]))
s += 1
beta_hat[s] = np.linalg.inv(self.X.T @ W @ self.X) @ self.X.T @ W @ Z
if np.linalg.norm(beta_hat[s] - beta_hat[s-1]) < tol:
break
self.beta_hat_final = beta_hat[s]
self.y_pred = sigmoid(self.X @ self.beta_hat_final)
@property
def aic(self):
log_likelihood = np.sum(y * np.log(self.y_pred) + (1 - y) * np.log(1 - self.y_pred))
return (log_likelihood - self.X.shape[0])
def get_aic(features):
model = LogisticRegression(features)
return model.aic
Performing backwards stepwise regression with the AIC criterion, we find the best model is the full model:
print("Subset:", "Score")
selected_features = all_features.copy()
remaining_features = all_features.copy()
best_score = get_aic(selected_features)
print(f"{tuple(selected_features)}: {best_score:.3f}")
while True:
candidate_feature = min(
[feature for feature in remaining_features],
key=lambda feature: get_aic(list(set(selected_features) - set([feature]))),
)
new_score = get_aic(list(set(selected_features) - set([candidate_feature])))
if new_score > best_score:
best_score = new_score
selected_features.remove(candidate_feature)
remaining_features.remove(candidate_feature)
print(f"{tuple(selected_features)}: {best_score:.3f}")
else:
break
Subset: Score ('sbp', 'tobacco', 'ldl', 'adiposity', 'famhist', 'typea', 'obesity', 'alcohol', 'age'): -698.070
Indeed, enumerating all subsets, the full model has the largest AIC:
subsets = sum(
[
list(map(list, combinations(all_features, i)))
for i in range(len(all_features) + 1)
],
[],
)
results = []
for subset in subsets:
row = {feature: "\u2713" if feature in subset else "" for feature in all_features}
row["AIC"] = get_aic(subset)
results.append(row)
df = pd.DataFrame(results)
df
sbp | tobacco | ldl | adiposity | famhist | typea | obesity | alcohol | age | AIC | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -760.054 | |||||||||
1 | ✓ | -751.661 | ||||||||
2 | ✓ | -739.324 | ||||||||
3 | ✓ | -744.139 | ||||||||
4 | ✓ | -744.526 | ||||||||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
507 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | -698.273 | |
508 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | -702.535 | |
509 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | -702.837 | |
510 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | -698.719 | |
511 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | -698.070 |
512 rows × 10 columns