import warnings
import matplotlib.pyplot as plt
import numpy as np
from numpy import log
from numpy.linalg import det, eigh, inv
import pandas as pd
from scipy.optimize._linesearch import LineSearchWarning
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.discriminant_analysis import (
LinearDiscriminantAnalysis as sk_LDA,
QuadraticDiscriminantAnalysis as sk_QDA,
)
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model import LogisticRegression as sk_LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
pd.set_option("display.precision", 3)
plt.style.use("default.mplstyle")
Prove Theorem 22.5:
The Bayes rule is optimal, that is, if $h$ is any other classification rule then $L(h^*) \le L(h)$.
Solution:
We may express the conditional misclassification probability of a classifier in terms of the regression function:
$$ \begin{align*} P(h(x) \ne Y \mid X = x) &= P(h(x) = 1, Y = 0 \mid X = x) + P(h(x) = 0, Y = 1 \mid X = x) \\ &= \mathbb{1}_{h(x) = 1} P(Y = 0, \mid X = x) + \mathbb{1}_{h(x) = 0} P(Y = 1 \mid X = x) \\ &= (1 - \mathbb{1}_{h(x) = 0}) (1 - r(x)) + \mathbb{1}_{h(x) = 0}r(x) \\ &= 1 - \mathbb{1}_{h(x) = 0} + (2 \cdot \mathbb{1}_{h(x) = 0} - 1)r(x). \end{align*} $$Thus, the difference in conditional misclassification probabilities between a generic classifer $h$ and the Bayes classifier $h^*$ can be expressed:
$$ \begin{align*} P(h(x) \ne Y \mid X = x) - P(h^*(X) \ne Y \mid X = x) &= (1 - 2r(x))( \mathbb{1}_{h^*(x) = 0} - \mathbb{1}_{h(x) = 0}). \end{align*} $$Since $h^*(x) = 1 \iff r(x) > 1/2$, the RHS is nonnegative. To see this, consider the cases:
Therefore,
$$ \begin{align*} L(h) - L(h^*) &= P(h(X) \ne Y) - P(h*\_(X) \ne Y) \\ &= \int P(h(x) \ne Y \mid X = x) - P(h^*(X) \ne Y \mid X = x) \, dP(x) \\ &\ge 0 \tag{integrand is nonnegative} \end{align*} $$Prove Theorem 22.7:
If $X \mid Y = 0 \sim N(\mu_0, \Sigma_0)$ and $X \mid Y = 1 \sim N(\mu_1, \Sigma_1)$, then the Bayes rule is
$$ h^*(x) = \begin{cases} 1 & \text{if } r_1^2 < r_0^2 + 2 \log\left( \frac{\pi_1}{\pi_2} \right) + \log \left( \frac{|\Sigma_0|}{|\Sigma_1|}\right) \\ 0 & \text{otherwise} \end{cases} $$where
$$r_i^2 = (x - \mu_i)^T \Sigma_i^{-1}(x - \mu_i), \,\,i = 1, 2$$is the Mahalanobis distance. An equivalent way of expressing the Bayes' rule is
$$h^*(x) = \argmax_k \delta_k(x)$$where
$$\delta_k(x) = -\frac{1}{2} \log |\Sigma_k| - \frac{1}{2} (x - \mu_k)^T \Sigma_k^{-1}(x - \mu_k) + \log \pi_k$$and $|A|$ denotes the determinant of a matrix $A$.6. Advanced Terminal Support: Well a powerful terminal is every developer’s dream right? You get access to the VS code terminal support. This includes
Solution:
From Theorem 22.6, the Bayes rule is
\begin{align*} h(x) &= \argmax_k P(Y = k \mid X = x) \\ &= \argmax_k \pi_k f_k(x) \end{align*}where
$$P(Y = k \mid X = x) = \frac{f_k(x) \pi_k}{\sum_r f_r(x) \pi_r},$$$\pi_r = P(Y = r)$ and $f_r(x) = f(x \mid Y = r)$.
In our case,
\begin{align*} \pi_k f_k(x) &= \frac{\pi_k}{(2 \pi)^{d/2} |\Sigma_k|^{1/2}}\exp \left\{ -\frac{1}{2} (x - \mu_k)^T \Sigma_k^{-1} (x - \mu_k) \right\} \\ &= \frac{\pi_k}{(2 \pi)^{d/2} |\Sigma_k|^{1/2}}\exp \left\{ -\frac{1}{2} r_k^2 \right\} \\ \end{align*}Note that the maximizer (over $k$) of $\pi_k f_k(x)$ also maximizes $\delta_k(x) \equiv \log \pi_k f_k(x) + (d/2) \log(2 \pi)$, which is:
\begin{align*} \delta_k(x) &= \log(\pi_k) - (1/2)\log(|\Sigma_k|) - (1/2)r^2_k \end{align*}If $h^*(x) = 1$, it must be the case that $\argmax \delta_k(x) = 1$, and thus
\begin{align*} \delta_0(x) &\le \delta_1(x) \\ \Rightarrow \log(\pi_0) - (1/2)\log(|\Sigma_0|) - (1/2)r^2_0 &\le \log(\pi_1) - (1/2)\log(|\Sigma_1|) - (1/2)r^2_1 \\ \Rightarrow r_1^2 < r_0^2 + 2 \log\left( \frac{\pi_1}{\pi_2} \right) + \log \left( \frac{|\Sigma_0|}{|\Sigma_1|}\right). \end{align*}Download the spam data from:
http://www-stat.stanford.edu/~tibs/ElemStatLearn/index.html
The data file can also be found on the course web page. The data contain 57 covariates relating to email messages. Each email message was classified as spam ($Y = 1$) or not spam ($Y = 0$). The outcome $Y$ is the last column in the file. The goal is to predict whether an email is spam or not.
(a) Construct classification rules using (i) LDA, (ii) QDA, (iii) logistic regression, and (iv) a classification tree. For each, report the observed misclassification error rate and construct a 2-by-2 table of the form
$\hat{h}(x) = 0$ | $\hat{h}(x) = 1$ | |
---|---|---|
$Y = 0$ | ?? | ?? |
$Y = 1$ | ?? | ?? |
(b) Use 5-fold cross validation to estimate the prediction accuracy of LDA and logistic regression.
(c) Sometimes it helps to reduce the number of covariates. One strategy is to compare $X_i$ for the spam and email group. For each of the 57 covariates, test whether the mean of the covariate is the same or different between the two groups. Keep the 10 covariates with the smallest $p$-values. Try LDA and logistic regression using only these 10 variables.
! curl -s --insecure https://www.stat.cmu.edu/~larry/all-of-statistics/=data/spam.dat --output data/spam.dat
def get_spam_data():
spam_df = pd.read_csv("data/spam.dat", sep="\s+", header=None)
X = spam_df[[x for x in range(57)]].values # 4601x57
y = spam_df[57].values # 4601x1
return X, y
def display_results(y_pred, y_true):
print(
pd.DataFrame(
confusion_matrix(y_pred, y_true),
index=["True Negative", "True Positive"],
columns=["Predicted Negative", "Predicted Positive"],
)
)
print(
f"\nMisclassification rate: {100 *(1 - accuracy_score(y_true, y_pred)):.2f}%"
)
class DiscriminantAnalysis:
def __init__(self, kind="linear"):
self.type = kind
def fit(self, X, y):
N = len(X)
class_indices = {k: np.flatnonzero(y == k) for k in [0, 1]}
self.X_ = {k: X[class_indices[k]] for k in [0, 1]}
self.n = {k: len(class_indices[k]) for k in [0, 1]}
self.pi = {k: self.n[k] / N for k in [0, 1]}
self.mu = {k: np.mean(self.X_[k], axis=0) for k in [0, 1]}
self.S = {k: np.cov(self.X_[k], rowvar=False) for k in [0, 1]}
self.U = {}
self.D = {}
for k in [0, 1]:
d, self.U[k] = eigh(self.S[k])
self.D[k] = np.diag(d)
if self.type == "linear":
self.S_common = (self.n[0] * self.S[0] + self.n[1] * self.S[1]) / (
self.n[0] + self.n[1]
)
def _mahanalobis_distance_squared(self, x, k):
diff = (x - self.mu[k]) @ self.U[k]
return np.sum(np.dot(diff, inv(self.D[k])) * diff, axis=1)
def _quadratic_delta(self, k, x):
result = (-1 / 2) * log(det(self.D[k]))
result += (-1 / 2) * self._mahanalobis_distance_squared(x, k)
result += log(self.pi[k])
return result
def _linear_delta(self, k, x):
S_common_inv = inv(self.S_common)
result = np.sum(np.dot(x, S_common_inv) * self.mu[k], axis=1)
result += -(1 / 2) * self.mu[k].T @ S_common_inv @ self.mu[k]
result += log(self.pi[k])
return result
def predict(self, x):
if self.type == "linear":
return np.where(
self._linear_delta(0, x) < self._linear_delta(1, x), 1, 0
)
else:
return np.where(
self._quadratic_delta(0, x) < self._quadratic_delta(1, x), 1, 0
)
X, y = get_spam_data()
results = {}
lda = DiscriminantAnalysis()
lda.fit(X, y)
y_pred = lda.predict(X)
results["LDA"] = accuracy_score(y, y_pred)
print("LDA")
display_results(y_pred, y)
qda = DiscriminantAnalysis(kind="quadratic")
qda.fit(X, y)
y_pred = qda.predict(X)
results["QDA"] = accuracy_score(y, y_pred)
print("\n\nQDA")
display_results(y_pred, y)
LDA Predicted Negative Predicted Positive True Negative 2663 387 True Positive 125 1426 Misclassification rate: 11.13% QDA Predicted Negative Predicted Positive True Negative 2101 82 True Positive 687 1731 Misclassification rate: 16.71%
clf = sk_LDA()
clf.fit(X, y)
y_pred = clf.predict(X)
print("\n\nLDA (scikit learn)")
display_results(y_pred, y)
clf = sk_QDA()
clf.fit(X, y)
y_pred = clf.predict(X)
print("\n\nQDA (scikit learn)")
display_results(y_pred, y)
LDA (scikit learn) Predicted Negative Predicted Positive True Negative 2663 387 True Positive 125 1426 Misclassification rate: 11.13% QDA (scikit learn) Predicted Negative Predicted Positive True Negative 2101 82 True Positive 687 1731 Misclassification rate: 16.71%
class LogisticRegression(BaseEstimator, ClassifierMixin):
def __init__(self, tol=1e-4):
self.tol = tol
pass
def fit(self, X, y):
beta_hat = {}
p = {}
s = 0
self.X = np.concatenate([X, np.ones((X.shape[0], 1))], axis=1)
beta_hat[0] = np.zeros(self.X.shape[1])
k = 0
while True and k < 5:
p[s] = self._sigmoid(self.X @ beta_hat[s])
Z = self._logit(p[s]) + (y - p[s]) / (p[s] * (1 - p[s]))
W = np.diag(p[s] * (1 - p[s]))
s += 1
beta_hat[s] = (
np.linalg.inv(self.X.T @ W @ self.X) @ self.X.T @ W @ Z
)
if np.linalg.norm(beta_hat[s] - beta_hat[s - 1]) < self.tol:
break
k = k + 1
self.beta_hat_final = beta_hat[s]
return self
def predict(self, X):
return np.round(self._sigmoid(self.X @ self.beta_hat_final))
def predict_proba(self, X):
return self._sigmoid(self.X @ self.beta_hat_final)
def _sigmoid(self, x):
return 1 / (1 + np.exp(-x) + 1e-6)
def _logit(self, x):
return np.log(x / (1 - x))
log_reg = LogisticRegression()
log_reg.fit(X, y)
y_pred = log_reg.predict(X)
print("Logistic Regression")
display_results(y_pred, y)
Logistic Regression Predicted Negative Predicted Positive True Negative 2660 201 True Positive 128 1612 Misclassification rate: 7.15%
log_reg = sk_LogisticRegression(
penalty="none",
max_iter=1000,
tol=1e-4,
fit_intercept=True,
solver="lbfgs",
)
log_reg.fit(X, y)
y_pred = log_reg.predict(X)
print("Logistic Regression (scikit-learn)")
display_results(y_pred, y)
Logistic Regression (scikit-learn) Predicted Negative Predicted Positive True Negative 2664 197 True Positive 124 1616 Misclassification rate: 6.98%
/home/eric-maria/eric/self_study/homl/.venv/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result(
tree = DecisionTreeClassifier()
tree.fit(X, y)
y_pred = tree.predict(X)
print("Decision Tree")
display_results(y_pred, y)
Decision Tree Predicted Negative Predicted Positive True Negative 2788 3 True Positive 0 1810 Misclassification rate: 0.07%
The decision tree is overfitting. The average out-of-sample misclassification rate is much higher:
misclassification_rates = cross_val_score(
tree,
X,
y,
scoring=make_scorer(
lambda y_true, y_pred: 1 - accuracy_score(y_true, y_pred)
),
)
print(
f"Cross-validated misclassification rate: {100 * np.mean(misclassification_rates):.2f}%"
)
Cross-validated misclassification rate: 11.91%
Obtaining cross validation results for Logistic Regression and LDA, we observe LDA performs better:
def custom_warning_handler(
message, category, filename, lineno, file=None, line=None
):
if isinstance(message, LineSearchWarning) or isinstance(
message, ConvergenceWarning
):
pass
else:
warnings.showwarning(message, category, filename, lineno, file, line)
warnings.showwarning = custom_warning_handler
lda = sk_LDA()
log_reg = sk_LogisticRegression()
print("5-Fold CV Accuracy Metrics:\n")
for name, estimator in zip(["LDA", "Logistic Regression"], [lda, log_reg]):
score = cross_val_score(estimator, X, y, scoring="accuracy")
print(f"{name} : {np.mean(score):.3f}")
5-Fold CV Accuracy Metrics: LDA : 0.880 Logistic Regression : 0.901
Running CV again, restricting covariates to the top 10 (as measured by significance of class mean differences):
from scipy.stats import ttest_ind
_, p_values = ttest_ind(X[np.flatnonzero(y == 0)], X[np.flatnonzero(y == 1)])
selected_cols = np.argpartition(p_values, 10)[:10]
X_sub = X[:, selected_cols]
print("5-Fold CV Accuracy Metrics:\n")
for name, estimator in zip(["LDA", "Logistic Regression"], [lda, log_reg]):
score = cross_val_score(estimator, X_sub, y, scoring="accuracy")
print(f"{name} : {np.mean(score):.3f}")
5-Fold CV Accuracy Metrics: LDA : 0.832 Logistic Regression : 0.865
In this case, dropping variables does not improve performance.
Let $\mathcal{A}$ be the set of two-dimensional spheres. That is, $A \in \mathcal{A}$ if $A = \{(x,y) : (x - a)^2 + (y - b)^2 \le c^2\}$ for some $a, b, c$. Find the VC-dimension of $\mathcal{A}$.
Solution:
The VC dimension is 3. We show this by demonstrating that for any configuration of 4 points, there is a subset of three points which are impossible to pick out using a member of $\mathcal{A}$. Denote the 4 points $A$, $B$, $C$ and $D$.
Suppose the 4 points do not form a convex hull. Then, there is a point which is contained within the triangle formed by the remaining three points. WLOG, let the interior point by $A$. Then, any member of $\mathcal{A}$ which contains $B$, $C$, and $D$ will necessarily contain $A$. Therefore, it is impossible to pick out the set $\{B, C, D\}$.
Suppose now that the 4 points do form a convex hull. Suppose this set can be shatted. WLOG, let $A$ and $B$ be opposite points on the diagonal, and the sum of the angles are less than or equal to 180. Observe that no circle can pick out only $A$ and $B$. If such a circle existed, the sum of the angles at $C$ and $D$ would be strictly less than 180, a contradiction.
Alternative proof: Suppose now that the 4 points do form a convex hull. Suppose this set can be shattered. Let $A$ and $B$ be opposite points on a diagonal. Consider two circles that pick out the sets $\{A, B\}$ and $\{C, D\}$. Consider the symmetric difference of the two circles and note that it must have 4 distinct regions. This is impossible, so we have a contradiction.
Classify the spam data using support vector machines. Free software for the support vector machine is at http://svmlight.joachims.org/
svc = SVC()
print(
f"Cross validation accuracy score : {np.mean(cross_val_score(svc, X, y)):.3f}"
)
print(
f"Accuracy (in-sample): {accuracy_score(y, svc.fit(X,y).predict(X)):.3f}"
)
Cross validation accuracy score : 0.705 Accuracy (in-sample): 0.727
SVM doesn't perform as well as Logistic Regression or LDA. Restricting covariates doesn't help:
svc = SVC()
print(
f"Cross validation accuracy score : {np.mean(cross_val_score(svc, X_sub, y)):.3f}"
)
print(
f"Accuracy (in-sample): {accuracy_score(y, svc.fit(X_sub,y).predict(X_sub)):.3f}"
)
Cross validation accuracy score : 0.690 Accuracy (in-sample): 0.693
Use VC theory to get a confidence interval on the true error rate of the LDA classifier for the iris data (from the book web site).
Solution: Note that LDA is a linear classifier. Thus, by Theorem 22.26, the $1 - \alpha$ confidence interval for the true error rate is $\hat{L}(\hat{h}) \pm \epsilon$ where
$$\epsilon_n^2 = \frac{32}{n} \log \left( \frac{8(n^{d+1} + 1)}{\alpha}\right).$$In the case of iris data, $n=4601$, $d=4$. Therefore,
$$ \begin{align*} \epsilon_n^2 &= \frac{32}{4601} \log \left( \frac{8(4601)^{4 + 1} + 1}{0.05}\right) \\ &\approx 0.32859 \Rightarrow \epsilon = 0.57323. \end{align*} $$Meanwhile, $\hat{L} \approx 0.02$. Thus, the 95% confidence interval is $(0, 0.59323)$.
np.sqrt((32 / 4601) * np.log((8 * ((4601**5) + 1)) / 0.05))
0.573229145027204
!curl -s --insecure https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data --output data/iris.data
def get_iris_data():
iris_df = pd.read_csv("data/iris.data", header=None, names=["sepal_length", "sepal_width", "petal_length", "petal_width", "class"])
y = iris_df["class"].values
X = iris_df[["sepal_length", "sepal_width", "petal_length", "petal_width"]].values
return X, y
X, y = get_iris_data()
lda = sk_LDA()
lda.fit(X, y)
print(f"Error rate (in sample): {(1 - accuracy_score(y, lda.predict(X))):.3f}")
Error rate (in sample): 0.020
Suppose that $X_i \in \mathbb{R}$ and that $Y_i = 1$ whenever $|X_i| \le 1$ and $Y_i = 0$ whenever $|X_i| > 1$. Show that no linear classifier can perfectly classify these data. Show that the kernelized data $Z_i = (X_i, X_i^2)$ can be linearly separated.
Solution:
A linear classifier in one dimension is a constant threshold. Observe that there is no threshold for $X$ such that $Y = 0$ on one side and $Y = 1$ on the other side.
Now consider the kernelized data. Let $u$ and $v$ denote the variables in the two-dimensional feature space. Observe that $Z_i$ lies on the parabola $v = u^2$ for all $i$. Moreover, note that if $Y_i = 0$, then $Z_i$ lies at or below $v = 1$, and if $Y_i = 1$, $Z_i$ lies above $v=1$. Thus, $v=1$ linearly separates the two classes.
u = np.random.random(size=2**6) * 4 - 2
index_lists = [np.flatnonzero(u**2 <= 1), np.flatnonzero(u**2 > 1)]
colors = ["blue", "orange"]
labels = ["$Y=0$", "$Y=1$"]
for i, index_list in enumerate(index_lists):
plt.scatter(
u[index_list],
np.power(u, 2)[index_list],
color=colors[i],
label=labels[i],
)
plt.plot([-2, 2], [1, 1], color="green", linewidth=3, label="v = 1")
plt.xlabel("$u$")
plt.ylabel("$v$")
plt.legend()
plt.show()
Repeat question 5 using the kernel $K(x, \tilde{x}) = (1 + x^T\tilde{x})^p$. Choose $p$ by cross-validation.
Solution: This is facilitated by sklearn
's implementation of the Support Vector Machine, which allows specification of, among others, the polynomial kernel.
from sklearn.svm import SVC
X, y = get_spam_data()
p_values = np.arange(0, 4, 1, dtype=int)
cv_results = {}
for p in p_values:
svc = SVC(kernel="poly", degree=p)
score = np.mean(cross_val_score(svc, X, y))
cv_results[p] = score
print(f"p = {p}:\tscore = {score:.3f}")
p = 0: score = 0.606 p = 1: score = 0.701 p = 2: score = 0.669 p = 3: score = 0.649
The best choice for $p$ appears to be $1$.
Apply the $k$ nearest neighbors classifier to the "iris data". Choose $k$ by cross-validation.
X, y = get_iris_data()
k_values = np.arange(1, 20, 1, dtype=np.int32)
cv_result = {}
for k in k_values:
clf = KNeighborsClassifier(n_neighbors=k)
cv_result[k] = np.mean(cross_val_score(clf, X, y))
plt.plot(cv_result.keys(), cv_result.values())
plt.xticks(k_values)
plt.grid()
plt.xlabel("$k$")
plt.ylabel("Accuracy (5-fold CV)")
plt.title("KNN on Iris Data")
plt.show()
Based on this I would choose $k = 9$.
(Curse of Dimensionality.) Suppose that $X$ has a uniform distribution on the $d$-dimensional cube $[-1/2, 1/2]^d$. Let $R$ be the distance from the origin to the closest neighbor. Show that the median of $R$ is
$$\left( \frac{\left(1 - (\frac12)\right)^{1/n}}{v_d(1)} \right)^{1/d}$$where
$$v_d(r) = r^d \frac{\pi^{d/2}}{\Gamma ((d/2) + 1)}$$is the volume of a sphere of radius $r$. For what dimension $d$ does the median of $R$ exceed the edge of the cube when $n=100$, $n=1,000$, $n=10,000$? (Hastie et al. (2001), p. 22-27.)
Solution:
Let $Z$ be the random variable denoting the distance from the origin to the closest data point. We seek $m$ such that $P(Z \le m) = 1 / 2$. Let $X_k$ be the distance from the origin to the $k^{\text{th}}$ data point. We have:
$$ \begin{align*} 1 / 2 = P(Z \le m) &= 1 - P(Z > m)\\ &= 1 - P(X_k > m ,\ \forall k)\\ &= 1 - P(X_k > m)^n\\ &= 1 - (1 - P(X_k \le m))^n. \end{align*} $$Note that $$P(X_k \le m) = \frac{\text{Volume of the $d$-dimensional ball of radius $m$}}{\text{Volume of the $d$-dimensional cube}} = v_d(m) = m^d v_d(1),$$ and therefore $$1 / 2 = 1 - (1 - m^d v_d(1))^n \Rightarrow m = \left( \frac{(1 - \frac{1}{2})^{1 / n}}{v_d(1)} \right) ^{1 / d}$$
I assume the edge of the cube refers to the distance to the nearest face, which is always $1/2$.
from scipy.special import gamma
def get_median(n, d):
return (1 - (1 / 2) ** (1 / n)) ** (1 / d)
print("dimension required for median to exceed distance to face:")
for n in [100, 1_000, 10_000]:
d = 1
while get_median(n, d) <= 0.5:
d += 1
print(f"n : {n}, d = {d}")
dimension required for median to exceed distance to face: n : 100, d = 8 n : 1000, d = 11 n : 10000, d = 14
Fit a tree to the data in question 3. Now apply bagging and report your results.
X, y = get_spam_data()
tree = DecisionTreeClassifier()
np.mean(cross_val_score(tree, X, y))
0.8811093801633385
from sklearn.ensemble import BaggingClassifier
clf = BaggingClassifier(DecisionTreeClassifier())
np.mean(cross_val_score(clf, X, y))
0.9047986593022707
Fit a tree that uses only one split on one variable to the data in question 3. Now apply boosting.
stump = DecisionTreeClassifier(max_depth=1)
stump.fit(X, y)
pred = stump.predict(X)
1 - accuracy_score(y, pred)
0.20625950880243427
X, y = get_spam_data()
n = len(X)
w = np.repeat(1 / n, n)
J = 3
stumps = [DecisionTreeClassifier(max_depth=1) for _ in range(J)]
alphas = []
for j in range(J):
stumps[j].fit(X, y, sample_weight=w)
pred = stumps[j].predict(X)
L_hat = np.dot(w, pred != y) / np.sum(w)
alpha = np.log((1 - L_hat) / L_hat)
w = w * np.exp(alpha * (pred != y))
alphas.append(alpha)
all_preds = np.array([stump.predict(X) for stump in stumps])
final_preds = np.sign(alphas @ all_preds)
1 - accuracy_score(y, final_preds)
0.43077591827863504
Let $r(x) = \mathbb{P}(Y = 1 \mid X = x)$ and let $\hat{r}(x)$ be an estimate of $r(x)$. Consider the classifier
$$ h(x) = \begin{cases} 1 & \text{if} \,\, \hat{r}(x) \ge 1 / 2 \\ 0 & \text{otherwise.} \end{cases} $$Assume that $\hat{r}(x) \approx N(\bar{r}(x), \sigma^2(x))$ for some functions $\hat{r}(x)$ and $\sigma^2(x)$. Show that, for fixed $x$,
$$ \begin{align*} \mathbb{P}(Y \ne h(x)) &\approx \mathbb{P}(Y \ne h^*(x)) + \left|2r(x) - 1 \right| \times \left[1 - \Phi\left( \frac{\text{sign}(r(x) - (1/2))(\hat{r}(x) - (1/2))}{\sigma(x)}\right) \right] \end{align*} $$where $\Phi$ is the standard Normal CDF and $h^*$ is the Bayes rule. Regard $\text{sign}\left((r(x) - (1/2))(\hat{r}(x) - (1/2))\right)$ as a type of bias term. Explain the implications for the bias-variance tradeoff in classification (Friedman (1997)).
Hint: first show that
$$\mathbb{P}(Y \ne h(x)) = |2r(x) - 1|\mathbb{P}(h(x) \ne h^*(x)) + \mathbb{P}(Y \ne h^*(x)).$$Solution:
Observe that for any classifier $h(x)$,
$$ \begin{align*} \mathbb{P}(Y \ne h(x)) &= \mathbb{P}(Y = 1, h(x) = 0) + \mathbb{P}(Y = 0, h(x) = 1) \\ &= \mathbb{1}_{h(x) = 0} \mathbb{P}(Y = 1 \mid X) + \mathbb{1}_{h(x) = 1} \mathbb{P}(Y = 0 \mid X) \\ &= (1 - \mathbb{1}_{h(x) = 1}) r(x) + \mathbb{1}_{h(x) = 1} (1 - r(x)) \\ &= r(x) - \mathbb{1}_{h(x) = 1}(2r(x) - 1). \end{align*} $$Therefore,
$$ \begin{align*} \mathbb{P}(Y \ne h(x)) - \mathbb{P}(Y \ne h^*(x)) &= (\mathbb{1}_{h^*(x) = 1} - \mathbb{1}_{h(x) = 1})(2r(x) - 1) \end{align*} $$Considering cases:
Thus,
$$(\mathbb{1}_{h^*(x) = 1} - \mathbb{1}_{h(x) = 1})(2r(x) - 1) = \mathbb{P}(h(x) \ne h^*(x))|2r(x) - 1|$$the hint statement holds. Now,
$$ \begin{align*} \mathbb{P}(h(x) \ne h^*(x)) &= \mathbb{P}(h(x) = 1, h^*(x) = 0) + \mathbb{P}(h(x) = 0, h^*(x) = 1) \\ &= \mathbb{P}(\hat{r}(x) \ge 1/2, r(x) < 1/2) + \mathbb{P}(\hat{r}(x) < 1/2, r(x) \ge 1/2) \\ &= \begin{cases} \mathbb{P}(\hat{r}(x) \ge 1/2) & r(x) < 1/2 \\ \mathbb{P}(\hat{r}(x) < 1/2) & r(x) \ge 1/2 \\ \end{cases} \\ &=\begin{cases} \mathbb{P}(Z \ge (1/2 - \bar{r}(x)) / \sigma(x)) & r(x) < 1/2 \\ \mathbb{P}(Z < (1/2 - \bar{r}(x)) / \sigma(x)) & r(x) \ge 1/2 \\ \end{cases} \tag{$Z \sim N(0,1)$}\\ &= \begin{cases} 1 - \Phi((1/2 - \bar{r}(x)) / \sigma(x)) & r(x) - (1/2) < 0 \\ 1 - \Phi(\bar{r}(x) - 1/2) / \sigma(x)) & r(x) - (1/2) \ge 0 \\ \end{cases} \tag{$\Phi(a) = 1 - \Phi(-a)$}\\ &= 1 - \Phi\left(\frac{\text{sign}(r(x) - (1/2))(\bar{r}(x) - 1/2)}{\sigma(x)}\right)\\ \end{align*} $$Recall the bias-variance tradeoff. As $\sigma(x) \to \infty$, $\Phi\left(\frac{\text{sign}(r(x) - (1/2))(\bar{r}(x) - 1/2)}{\sigma(x)}\right) \to 1/2$, meaning
$$\mathbb{P}(Y \ne h(x)) \to |r(x) - 1/2| + \mathbb{P}(Y \ne h^*(x))$$Meanwhile, as $\sigma \to 0$, $\Phi\left(\frac{\text{sign}(r(x) - (1/2))(\bar{r}(x) - 1/2)}{\sigma(x)}\right) \to 1$ if $h(x) = h^*(x)$ (as the numerator would be positive), and $0$ otherwise. Thus,
$$ \mathbb{P}(Y \ne h(x)) \to \begin{cases} |2r(x) - 1| + \mathbb{P}(Y \ne h^*(x)) & h(x) \ne h^*(x) \\ \mathbb{P}(Y \ne h^*(x)) & h(x) = h^*(x) \\ \end{cases} $$