In [1]:
import warnings

import matplotlib.pyplot as plt
import numpy as np
from numpy import log
from numpy.linalg import det, eigh, inv
import pandas as pd
from scipy.optimize._linesearch import LineSearchWarning
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.discriminant_analysis import (
    LinearDiscriminantAnalysis as sk_LDA,
    QuadraticDiscriminantAnalysis as sk_QDA,
)
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model import LogisticRegression as sk_LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier


pd.set_option("display.precision", 3)
plt.style.use("default.mplstyle")

1¶

Prove Theorem 22.5:

The Bayes rule is optimal, that is, if $h$ is any other classification rule then $L(h^*) \le L(h)$.

Solution:

We may express the conditional misclassification probability of a classifier in terms of the regression function:

$$ \begin{align*} P(h(x) \ne Y \mid X = x) &= P(h(x) = 1, Y = 0 \mid X = x) + P(h(x) = 0, Y = 1 \mid X = x) \\ &= \mathbb{1}_{h(x) = 1} P(Y = 0, \mid X = x) + \mathbb{1}_{h(x) = 0} P(Y = 1 \mid X = x) \\ &= (1 - \mathbb{1}_{h(x) = 0}) (1 - r(x)) + \mathbb{1}_{h(x) = 0}r(x) \\ &= 1 - \mathbb{1}_{h(x) = 0} + (2 \cdot \mathbb{1}_{h(x) = 0} - 1)r(x). \end{align*} $$

Thus, the difference in conditional misclassification probabilities between a generic classifer $h$ and the Bayes classifier $h^*$ can be expressed:

$$ \begin{align*} P(h(x) \ne Y \mid X = x) - P(h^*(X) \ne Y \mid X = x) &= (1 - 2r(x))( \mathbb{1}_{h^*(x) = 0} - \mathbb{1}_{h(x) = 0}). \end{align*} $$

Since $h^*(x) = 1 \iff r(x) > 1/2$, the RHS is nonnegative. To see this, consider the cases:

  • If $h(x) = h^*(x)$, then the second factor is 0.
  • If $h(x) \ne h^*(x)$ and $h^*(x) = 0$, then $r(x) \le 1/2$ and $h(x) = 1$, and thus both factors are nonnegative.
  • If $h(x) \ne h^*(x)$ and $h^*(x) = 1$, then $r(x) > 1/2$ and $h(x) = 0$, and thus both factors are strictly negative.

Therefore,

$$ \begin{align*} L(h) - L(h^*) &= P(h(X) \ne Y) - P(h*\_(X) \ne Y) \\ &= \int P(h(x) \ne Y \mid X = x) - P(h^*(X) \ne Y \mid X = x) \, dP(x) \\ &\ge 0 \tag{integrand is nonnegative} \end{align*} $$

2¶

Prove Theorem 22.7:

If $X \mid Y = 0 \sim N(\mu_0, \Sigma_0)$ and $X \mid Y = 1 \sim N(\mu_1, \Sigma_1)$, then the Bayes rule is

$$ h^*(x) = \begin{cases} 1 & \text{if } r_1^2 < r_0^2 + 2 \log\left( \frac{\pi_1}{\pi_2} \right) + \log \left( \frac{|\Sigma_0|}{|\Sigma_1|}\right) \\ 0 & \text{otherwise} \end{cases} $$

where

$$r_i^2 = (x - \mu_i)^T \Sigma_i^{-1}(x - \mu_i), \,\,i = 1, 2$$

is the Mahalanobis distance. An equivalent way of expressing the Bayes' rule is

$$h^*(x) = \argmax_k \delta_k(x)$$

where

$$\delta_k(x) = -\frac{1}{2} \log |\Sigma_k| - \frac{1}{2} (x - \mu_k)^T \Sigma_k^{-1}(x - \mu_k) + \log \pi_k$$

and $|A|$ denotes the determinant of a matrix $A$.6. Advanced Terminal Support: Well a powerful terminal is every developer’s dream right? You get access to the VS code terminal support. This includes

Solution:

From Theorem 22.6, the Bayes rule is

\begin{align*} h(x) &= \argmax_k P(Y = k \mid X = x) \\ &= \argmax_k \pi_k f_k(x) \end{align*}

where

$$P(Y = k \mid X = x) = \frac{f_k(x) \pi_k}{\sum_r f_r(x) \pi_r},$$

$\pi_r = P(Y = r)$ and $f_r(x) = f(x \mid Y = r)$.

In our case,

\begin{align*} \pi_k f_k(x) &= \frac{\pi_k}{(2 \pi)^{d/2} |\Sigma_k|^{1/2}}\exp \left\{ -\frac{1}{2} (x - \mu_k)^T \Sigma_k^{-1} (x - \mu_k) \right\} \\ &= \frac{\pi_k}{(2 \pi)^{d/2} |\Sigma_k|^{1/2}}\exp \left\{ -\frac{1}{2} r_k^2 \right\} \\ \end{align*}

Note that the maximizer (over $k$) of $\pi_k f_k(x)$ also maximizes $\delta_k(x) \equiv \log \pi_k f_k(x) + (d/2) \log(2 \pi)$, which is:

\begin{align*} \delta_k(x) &= \log(\pi_k) - (1/2)\log(|\Sigma_k|) - (1/2)r^2_k \end{align*}

If $h^*(x) = 1$, it must be the case that $\argmax \delta_k(x) = 1$, and thus

\begin{align*} \delta_0(x) &\le \delta_1(x) \\ \Rightarrow \log(\pi_0) - (1/2)\log(|\Sigma_0|) - (1/2)r^2_0 &\le \log(\pi_1) - (1/2)\log(|\Sigma_1|) - (1/2)r^2_1 \\ \Rightarrow r_1^2 < r_0^2 + 2 \log\left( \frac{\pi_1}{\pi_2} \right) + \log \left( \frac{|\Sigma_0|}{|\Sigma_1|}\right). \end{align*}

3¶

Download the spam data from:

http://www-stat.stanford.edu/~tibs/ElemStatLearn/index.html

The data file can also be found on the course web page. The data contain 57 covariates relating to email messages. Each email message was classified as spam ($Y = 1$) or not spam ($Y = 0$). The outcome $Y$ is the last column in the file. The goal is to predict whether an email is spam or not.

(a) Construct classification rules using (i) LDA, (ii) QDA, (iii) logistic regression, and (iv) a classification tree. For each, report the observed misclassification error rate and construct a 2-by-2 table of the form

$\hat{h}(x) = 0$ $\hat{h}(x) = 1$
$Y = 0$ ?? ??
$Y = 1$ ?? ??

(b) Use 5-fold cross validation to estimate the prediction accuracy of LDA and logistic regression.

(c) Sometimes it helps to reduce the number of covariates. One strategy is to compare $X_i$ for the spam and email group. For each of the 57 covariates, test whether the mean of the covariate is the same or different between the two groups. Keep the 10 covariates with the smallest $p$-values. Try LDA and logistic regression using only these 10 variables.

In [2]:
! curl -s --insecure https://www.stat.cmu.edu/~larry/all-of-statistics/=data/spam.dat --output data/spam.dat

def get_spam_data():
    spam_df = pd.read_csv("data/spam.dat", sep="\s+", header=None)
    X = spam_df[[x for x in range(57)]].values  # 4601x57
    y = spam_df[57].values  # 4601x1
    return X, y

def display_results(y_pred, y_true):
    print(
        pd.DataFrame(
            confusion_matrix(y_pred, y_true),
            index=["True Negative", "True Positive"],
            columns=["Predicted Negative", "Predicted Positive"],
        )
    )
    print(
        f"\nMisclassification rate: {100 *(1 - accuracy_score(y_true, y_pred)):.2f}%"
    )
In [3]:
class DiscriminantAnalysis:
    def __init__(self, kind="linear"):
        self.type = kind

    def fit(self, X, y):
        N = len(X)
        class_indices = {k: np.flatnonzero(y == k) for k in [0, 1]}
        self.X_ = {k: X[class_indices[k]] for k in [0, 1]}
        self.n = {k: len(class_indices[k]) for k in [0, 1]}
        self.pi = {k: self.n[k] / N for k in [0, 1]}
        self.mu = {k: np.mean(self.X_[k], axis=0) for k in [0, 1]}
        self.S = {k: np.cov(self.X_[k], rowvar=False) for k in [0, 1]}
        self.U = {}
        self.D = {}
        for k in [0, 1]:
            d, self.U[k] = eigh(self.S[k])
            self.D[k] = np.diag(d)
        if self.type == "linear":
            self.S_common = (self.n[0] * self.S[0] + self.n[1] * self.S[1]) / (
                self.n[0] + self.n[1]
            )

    def _mahanalobis_distance_squared(self, x, k):
        diff = (x - self.mu[k]) @ self.U[k]
        return np.sum(np.dot(diff, inv(self.D[k])) * diff, axis=1)

    def _quadratic_delta(self, k, x):
        result = (-1 / 2) * log(det(self.D[k]))
        result += (-1 / 2) * self._mahanalobis_distance_squared(x, k)
        result += log(self.pi[k])
        return result

    def _linear_delta(self, k, x):
        S_common_inv = inv(self.S_common)
        result = np.sum(np.dot(x, S_common_inv) * self.mu[k], axis=1)
        result += -(1 / 2) * self.mu[k].T @ S_common_inv @ self.mu[k]
        result += log(self.pi[k])
        return result

    def predict(self, x):
        if self.type == "linear":
            return np.where(
                self._linear_delta(0, x) < self._linear_delta(1, x), 1, 0
            )
        else:
            return np.where(
                self._quadratic_delta(0, x) < self._quadratic_delta(1, x), 1, 0
            )


X, y = get_spam_data()
results = {}
lda = DiscriminantAnalysis()
lda.fit(X, y)
y_pred = lda.predict(X)
results["LDA"] = accuracy_score(y, y_pred)
print("LDA")
display_results(y_pred, y)

qda = DiscriminantAnalysis(kind="quadratic")
qda.fit(X, y)
y_pred = qda.predict(X)
results["QDA"] = accuracy_score(y, y_pred)
print("\n\nQDA")
display_results(y_pred, y)
LDA
               Predicted Negative  Predicted Positive
True Negative                2663                 387
True Positive                 125                1426

Misclassification rate: 11.13%


QDA
               Predicted Negative  Predicted Positive
True Negative                2101                  82
True Positive                 687                1731

Misclassification rate: 16.71%
In [4]:
clf = sk_LDA()
clf.fit(X, y)
y_pred = clf.predict(X)
print("\n\nLDA (scikit learn)")
display_results(y_pred, y)

clf = sk_QDA()
clf.fit(X, y)
y_pred = clf.predict(X)
print("\n\nQDA (scikit learn)")
display_results(y_pred, y)

LDA (scikit learn)
               Predicted Negative  Predicted Positive
True Negative                2663                 387
True Positive                 125                1426

Misclassification rate: 11.13%


QDA (scikit learn)
               Predicted Negative  Predicted Positive
True Negative                2101                  82
True Positive                 687                1731

Misclassification rate: 16.71%
In [5]:
class LogisticRegression(BaseEstimator, ClassifierMixin):
    def __init__(self, tol=1e-4):
        self.tol = tol
        pass

    def fit(self, X, y):
        beta_hat = {}
        p = {}
        s = 0
        self.X = np.concatenate([X, np.ones((X.shape[0], 1))], axis=1)
        beta_hat[0] = np.zeros(self.X.shape[1])
        k = 0
        while True and k < 5:
            p[s] = self._sigmoid(self.X @ beta_hat[s])
            Z = self._logit(p[s]) + (y - p[s]) / (p[s] * (1 - p[s]))
            W = np.diag(p[s] * (1 - p[s]))
            s += 1
            beta_hat[s] = (
                np.linalg.inv(self.X.T @ W @ self.X) @ self.X.T @ W @ Z
            )
            if np.linalg.norm(beta_hat[s] - beta_hat[s - 1]) < self.tol:
                break
            k = k + 1
        self.beta_hat_final = beta_hat[s]
        return self

    def predict(self, X):
        return np.round(self._sigmoid(self.X @ self.beta_hat_final))

    def predict_proba(self, X):
        return self._sigmoid(self.X @ self.beta_hat_final)

    def _sigmoid(self, x):
        return 1 / (1 + np.exp(-x) + 1e-6)

    def _logit(self, x):
        return np.log(x / (1 - x))


log_reg = LogisticRegression()
log_reg.fit(X, y)
y_pred = log_reg.predict(X)
print("Logistic Regression")
display_results(y_pred, y)
Logistic Regression
               Predicted Negative  Predicted Positive
True Negative                2660                 201
True Positive                 128                1612

Misclassification rate: 7.15%
In [6]:
log_reg = sk_LogisticRegression(
    penalty="none",
    max_iter=1000,
    tol=1e-4,
    fit_intercept=True,
    solver="lbfgs",
)
log_reg.fit(X, y)
y_pred = log_reg.predict(X)
print("Logistic Regression (scikit-learn)")
display_results(y_pred, y)
Logistic Regression (scikit-learn)
               Predicted Negative  Predicted Positive
True Negative                2664                 197
True Positive                 124                1616

Misclassification rate: 6.98%
/home/eric-maria/eric/self_study/homl/.venv/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
In [7]:
tree = DecisionTreeClassifier()
tree.fit(X, y)
y_pred = tree.predict(X)
print("Decision Tree")
display_results(y_pred, y)
Decision Tree
               Predicted Negative  Predicted Positive
True Negative                2788                   3
True Positive                   0                1810

Misclassification rate: 0.07%

The decision tree is overfitting. The average out-of-sample misclassification rate is much higher:

In [8]:
misclassification_rates = cross_val_score(
    tree,
    X,
    y,
    scoring=make_scorer(
        lambda y_true, y_pred: 1 - accuracy_score(y_true, y_pred)
    ),
)
print(
    f"Cross-validated misclassification rate: {100 * np.mean(misclassification_rates):.2f}%"
)
Cross-validated misclassification rate: 11.91%

Obtaining cross validation results for Logistic Regression and LDA, we observe LDA performs better:

In [9]:
def custom_warning_handler(
    message, category, filename, lineno, file=None, line=None
):
    if isinstance(message, LineSearchWarning) or isinstance(
        message, ConvergenceWarning
    ):
        pass
    else:
        warnings.showwarning(message, category, filename, lineno, file, line)


warnings.showwarning = custom_warning_handler
lda = sk_LDA()
log_reg = sk_LogisticRegression()
print("5-Fold CV Accuracy Metrics:\n")
for name, estimator in zip(["LDA", "Logistic Regression"], [lda, log_reg]):
    score = cross_val_score(estimator, X, y, scoring="accuracy")
    print(f"{name} : {np.mean(score):.3f}")
5-Fold CV Accuracy Metrics:

LDA : 0.880
Logistic Regression : 0.901

Running CV again, restricting covariates to the top 10 (as measured by significance of class mean differences):

In [10]:
from scipy.stats import ttest_ind

_, p_values = ttest_ind(X[np.flatnonzero(y == 0)], X[np.flatnonzero(y == 1)])
selected_cols = np.argpartition(p_values, 10)[:10]
X_sub = X[:, selected_cols]

print("5-Fold CV Accuracy Metrics:\n")
for name, estimator in zip(["LDA", "Logistic Regression"], [lda, log_reg]):
    score = cross_val_score(estimator, X_sub, y, scoring="accuracy")
    print(f"{name} : {np.mean(score):.3f}")
5-Fold CV Accuracy Metrics:

LDA : 0.832
Logistic Regression : 0.865

In this case, dropping variables does not improve performance.

4¶

Let $\mathcal{A}$ be the set of two-dimensional spheres. That is, $A \in \mathcal{A}$ if $A = \{(x,y) : (x - a)^2 + (y - b)^2 \le c^2\}$ for some $a, b, c$. Find the VC-dimension of $\mathcal{A}$.

Solution:

The VC dimension is 3. We show this by demonstrating that for any configuration of 4 points, there is a subset of three points which are impossible to pick out using a member of $\mathcal{A}$. Denote the 4 points $A$, $B$, $C$ and $D$.

Suppose the 4 points do not form a convex hull. Then, there is a point which is contained within the triangle formed by the remaining three points. WLOG, let the interior point by $A$. Then, any member of $\mathcal{A}$ which contains $B$, $C$, and $D$ will necessarily contain $A$. Therefore, it is impossible to pick out the set $\{B, C, D\}$.

Suppose now that the 4 points do form a convex hull. Suppose this set can be shatted. WLOG, let $A$ and $B$ be opposite points on the diagonal, and the sum of the angles are less than or equal to 180. Observe that no circle can pick out only $A$ and $B$. If such a circle existed, the sum of the angles at $C$ and $D$ would be strictly less than 180, a contradiction.

Alternative proof: Suppose now that the 4 points do form a convex hull. Suppose this set can be shattered. Let $A$ and $B$ be opposite points on a diagonal. Consider two circles that pick out the sets $\{A, B\}$ and $\{C, D\}$. Consider the symmetric difference of the two circles and note that it must have 4 distinct regions. This is impossible, so we have a contradiction.

5¶

Classify the spam data using support vector machines. Free software for the support vector machine is at http://svmlight.joachims.org/

In [11]:
svc = SVC()
print(
    f"Cross validation accuracy score : {np.mean(cross_val_score(svc, X, y)):.3f}"
)
print(
    f"Accuracy (in-sample): {accuracy_score(y, svc.fit(X,y).predict(X)):.3f}"
)
Cross validation accuracy score : 0.705
Accuracy (in-sample): 0.727

SVM doesn't perform as well as Logistic Regression or LDA. Restricting covariates doesn't help:

In [12]:
svc = SVC()
print(
    f"Cross validation accuracy score : {np.mean(cross_val_score(svc, X_sub, y)):.3f}"
)
print(
    f"Accuracy (in-sample): {accuracy_score(y, svc.fit(X_sub,y).predict(X_sub)):.3f}"
)
Cross validation accuracy score : 0.690
Accuracy (in-sample): 0.693

6¶

Use VC theory to get a confidence interval on the true error rate of the LDA classifier for the iris data (from the book web site).

Solution: Note that LDA is a linear classifier. Thus, by Theorem 22.26, the $1 - \alpha$ confidence interval for the true error rate is $\hat{L}(\hat{h}) \pm \epsilon$ where

$$\epsilon_n^2 = \frac{32}{n} \log \left( \frac{8(n^{d+1} + 1)}{\alpha}\right).$$

In the case of iris data, $n=4601$, $d=4$. Therefore,

$$ \begin{align*} \epsilon_n^2 &= \frac{32}{4601} \log \left( \frac{8(4601)^{4 + 1} + 1}{0.05}\right) \\ &\approx 0.32859 \Rightarrow \epsilon = 0.57323. \end{align*} $$

Meanwhile, $\hat{L} \approx 0.02$. Thus, the 95% confidence interval is $(0, 0.59323)$.

In [13]:
np.sqrt((32 / 4601) * np.log((8 * ((4601**5) + 1)) / 0.05))
Out[13]:
0.573229145027204
In [14]:
!curl -s --insecure https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data --output data/iris.data

def get_iris_data():
    iris_df = pd.read_csv("data/iris.data", header=None, names=["sepal_length", "sepal_width", "petal_length", "petal_width", "class"])
    y = iris_df["class"].values
    X = iris_df[["sepal_length", "sepal_width", "petal_length", "petal_width"]].values
    return X, y
X, y = get_iris_data()
lda = sk_LDA()
lda.fit(X, y)
print(f"Error rate (in sample): {(1 - accuracy_score(y, lda.predict(X))):.3f}")
Error rate (in sample): 0.020

7¶

Suppose that $X_i \in \mathbb{R}$ and that $Y_i = 1$ whenever $|X_i| \le 1$ and $Y_i = 0$ whenever $|X_i| > 1$. Show that no linear classifier can perfectly classify these data. Show that the kernelized data $Z_i = (X_i, X_i^2)$ can be linearly separated.

Solution:

A linear classifier in one dimension is a constant threshold. Observe that there is no threshold for $X$ such that $Y = 0$ on one side and $Y = 1$ on the other side.

Now consider the kernelized data. Let $u$ and $v$ denote the variables in the two-dimensional feature space. Observe that $Z_i$ lies on the parabola $v = u^2$ for all $i$. Moreover, note that if $Y_i = 0$, then $Z_i$ lies at or below $v = 1$, and if $Y_i = 1$, $Z_i$ lies above $v=1$. Thus, $v=1$ linearly separates the two classes.

In [15]:
u = np.random.random(size=2**6) * 4 - 2
index_lists = [np.flatnonzero(u**2 <= 1), np.flatnonzero(u**2 > 1)]
colors = ["blue", "orange"]
labels = ["$Y=0$", "$Y=1$"]
for i, index_list in enumerate(index_lists):
    plt.scatter(
        u[index_list],
        np.power(u, 2)[index_list],
        color=colors[i],
        label=labels[i],
    )
plt.plot([-2, 2], [1, 1], color="green", linewidth=3, label="v = 1")
plt.xlabel("$u$")
plt.ylabel("$v$")
plt.legend()
plt.show()

8¶

Repeat question 5 using the kernel $K(x, \tilde{x}) = (1 + x^T\tilde{x})^p$. Choose $p$ by cross-validation.

Solution: This is facilitated by sklearn's implementation of the Support Vector Machine, which allows specification of, among others, the polynomial kernel.

In [16]:
from sklearn.svm import SVC

X, y = get_spam_data()
p_values = np.arange(0, 4, 1, dtype=int)
cv_results = {}
for p in p_values:
    svc = SVC(kernel="poly", degree=p)
    score = np.mean(cross_val_score(svc, X, y))
    cv_results[p] = score
    print(f"p = {p}:\tscore = {score:.3f}")
p = 0:	score = 0.606
p = 1:	score = 0.701
p = 2:	score = 0.669
p = 3:	score = 0.649

The best choice for $p$ appears to be $1$.

9¶

Apply the $k$ nearest neighbors classifier to the "iris data". Choose $k$ by cross-validation.

In [17]:
X, y = get_iris_data()
k_values = np.arange(1, 20, 1, dtype=np.int32)
cv_result = {}
for k in k_values:
    clf = KNeighborsClassifier(n_neighbors=k)
    cv_result[k] = np.mean(cross_val_score(clf, X, y))

plt.plot(cv_result.keys(), cv_result.values())
plt.xticks(k_values)
plt.grid()
plt.xlabel("$k$")
plt.ylabel("Accuracy (5-fold CV)")
plt.title("KNN on Iris Data")
plt.show()

Based on this I would choose $k = 9$.

10¶

(Curse of Dimensionality.) Suppose that $X$ has a uniform distribution on the $d$-dimensional cube $[-1/2, 1/2]^d$. Let $R$ be the distance from the origin to the closest neighbor. Show that the median of $R$ is

$$\left( \frac{\left(1 - (\frac12)\right)^{1/n}}{v_d(1)} \right)^{1/d}$$

where

$$v_d(r) = r^d \frac{\pi^{d/2}}{\Gamma ((d/2) + 1)}$$

is the volume of a sphere of radius $r$. For what dimension $d$ does the median of $R$ exceed the edge of the cube when $n=100$, $n=1,000$, $n=10,000$? (Hastie et al. (2001), p. 22-27.)

Solution:

Let $Z$ be the random variable denoting the distance from the origin to the closest data point. We seek $m$ such that $P(Z \le m) = 1 / 2$. Let $X_k$ be the distance from the origin to the $k^{\text{th}}$ data point. We have:

$$ \begin{align*} 1 / 2 = P(Z \le m) &= 1 - P(Z > m)\\ &= 1 - P(X_k > m ,\ \forall k)\\ &= 1 - P(X_k > m)^n\\ &= 1 - (1 - P(X_k \le m))^n. \end{align*} $$

Note that $$P(X_k \le m) = \frac{\text{Volume of the $d$-dimensional ball of radius $m$}}{\text{Volume of the $d$-dimensional cube}} = v_d(m) = m^d v_d(1),$$ and therefore $$1 / 2 = 1 - (1 - m^d v_d(1))^n \Rightarrow m = \left( \frac{(1 - \frac{1}{2})^{1 / n}}{v_d(1)} \right) ^{1 / d}$$

I assume the edge of the cube refers to the distance to the nearest face, which is always $1/2$.

In [18]:
from scipy.special import gamma
In [19]:
def get_median(n, d):
    return (1 - (1 / 2) ** (1 / n)) ** (1 / d)


print("dimension required for median to exceed distance to face:")
for n in [100, 1_000, 10_000]:
    d = 1
    while get_median(n, d) <= 0.5:
        d += 1
    print(f"n : {n}, d = {d}")
dimension required for median to exceed distance to face:
n : 100, d = 8
n : 1000, d = 11
n : 10000, d = 14

11¶

Fit a tree to the data in question 3. Now apply bagging and report your results.

In [20]:
X, y = get_spam_data()
tree = DecisionTreeClassifier()
np.mean(cross_val_score(tree, X, y))
Out[20]:
0.8811093801633385
In [21]:
from sklearn.ensemble import BaggingClassifier

clf = BaggingClassifier(DecisionTreeClassifier())
np.mean(cross_val_score(clf, X, y))
Out[21]:
0.9047986593022707

12¶

Fit a tree that uses only one split on one variable to the data in question 3. Now apply boosting.

In [22]:
stump = DecisionTreeClassifier(max_depth=1)
stump.fit(X, y)
pred = stump.predict(X)
1 - accuracy_score(y, pred)
Out[22]:
0.20625950880243427
In [23]:
X, y = get_spam_data()
n = len(X)
w = np.repeat(1 / n, n)
J = 3
stumps = [DecisionTreeClassifier(max_depth=1) for _ in range(J)]
alphas = []
for j in range(J):
    stumps[j].fit(X, y, sample_weight=w)
    pred = stumps[j].predict(X)
    L_hat = np.dot(w, pred != y) / np.sum(w)
    alpha = np.log((1 - L_hat) / L_hat)
    w = w * np.exp(alpha * (pred != y))
    alphas.append(alpha)

all_preds = np.array([stump.predict(X) for stump in stumps])
final_preds = np.sign(alphas @ all_preds)
1 - accuracy_score(y, final_preds)
Out[23]:
0.43077591827863504

13¶

Let $r(x) = \mathbb{P}(Y = 1 \mid X = x)$ and let $\hat{r}(x)$ be an estimate of $r(x)$. Consider the classifier

$$ h(x) = \begin{cases} 1 & \text{if} \,\, \hat{r}(x) \ge 1 / 2 \\ 0 & \text{otherwise.} \end{cases} $$

Assume that $\hat{r}(x) \approx N(\bar{r}(x), \sigma^2(x))$ for some functions $\hat{r}(x)$ and $\sigma^2(x)$. Show that, for fixed $x$,

$$ \begin{align*} \mathbb{P}(Y \ne h(x)) &\approx \mathbb{P}(Y \ne h^*(x)) + \left|2r(x) - 1 \right| \times \left[1 - \Phi\left( \frac{\text{sign}(r(x) - (1/2))(\hat{r}(x) - (1/2))}{\sigma(x)}\right) \right] \end{align*} $$

where $\Phi$ is the standard Normal CDF and $h^*$ is the Bayes rule. Regard $\text{sign}\left((r(x) - (1/2))(\hat{r}(x) - (1/2))\right)$ as a type of bias term. Explain the implications for the bias-variance tradeoff in classification (Friedman (1997)).

Hint: first show that

$$\mathbb{P}(Y \ne h(x)) = |2r(x) - 1|\mathbb{P}(h(x) \ne h^*(x)) + \mathbb{P}(Y \ne h^*(x)).$$

Solution:

Observe that for any classifier $h(x)$,

$$ \begin{align*} \mathbb{P}(Y \ne h(x)) &= \mathbb{P}(Y = 1, h(x) = 0) + \mathbb{P}(Y = 0, h(x) = 1) \\ &= \mathbb{1}_{h(x) = 0} \mathbb{P}(Y = 1 \mid X) + \mathbb{1}_{h(x) = 1} \mathbb{P}(Y = 0 \mid X) \\ &= (1 - \mathbb{1}_{h(x) = 1}) r(x) + \mathbb{1}_{h(x) = 1} (1 - r(x)) \\ &= r(x) - \mathbb{1}_{h(x) = 1}(2r(x) - 1). \end{align*} $$

Therefore,

$$ \begin{align*} \mathbb{P}(Y \ne h(x)) - \mathbb{P}(Y \ne h^*(x)) &= (\mathbb{1}_{h^*(x) = 1} - \mathbb{1}_{h(x) = 1})(2r(x) - 1) \end{align*} $$

Considering cases:

  • $h(x) = h^*(x)$ : The RHS $= 0$
  • $h(x) = 0$, $h^*(x) = 1$ : The RHS $=2r(x) - 1 = |2r(x) - 1|$
  • $h(x) = 1$, $h^*(x) = 0$ : The RHS $=1 - 2r(x) = |2r(x) - 1|$

Thus,

$$(\mathbb{1}_{h^*(x) = 1} - \mathbb{1}_{h(x) = 1})(2r(x) - 1) = \mathbb{P}(h(x) \ne h^*(x))|2r(x) - 1|$$

the hint statement holds. Now,

$$ \begin{align*} \mathbb{P}(h(x) \ne h^*(x)) &= \mathbb{P}(h(x) = 1, h^*(x) = 0) + \mathbb{P}(h(x) = 0, h^*(x) = 1) \\ &= \mathbb{P}(\hat{r}(x) \ge 1/2, r(x) < 1/2) + \mathbb{P}(\hat{r}(x) < 1/2, r(x) \ge 1/2) \\ &= \begin{cases} \mathbb{P}(\hat{r}(x) \ge 1/2) & r(x) < 1/2 \\ \mathbb{P}(\hat{r}(x) < 1/2) & r(x) \ge 1/2 \\ \end{cases} \\ &=\begin{cases} \mathbb{P}(Z \ge (1/2 - \bar{r}(x)) / \sigma(x)) & r(x) < 1/2 \\ \mathbb{P}(Z < (1/2 - \bar{r}(x)) / \sigma(x)) & r(x) \ge 1/2 \\ \end{cases} \tag{$Z \sim N(0,1)$}\\ &= \begin{cases} 1 - \Phi((1/2 - \bar{r}(x)) / \sigma(x)) & r(x) - (1/2) < 0 \\ 1 - \Phi(\bar{r}(x) - 1/2) / \sigma(x)) & r(x) - (1/2) \ge 0 \\ \end{cases} \tag{$\Phi(a) = 1 - \Phi(-a)$}\\ &= 1 - \Phi\left(\frac{\text{sign}(r(x) - (1/2))(\bar{r}(x) - 1/2)}{\sigma(x)}\right)\\ \end{align*} $$

Recall the bias-variance tradeoff. As $\sigma(x) \to \infty$, $\Phi\left(\frac{\text{sign}(r(x) - (1/2))(\bar{r}(x) - 1/2)}{\sigma(x)}\right) \to 1/2$, meaning

$$\mathbb{P}(Y \ne h(x)) \to |r(x) - 1/2| + \mathbb{P}(Y \ne h^*(x))$$

Meanwhile, as $\sigma \to 0$, $\Phi\left(\frac{\text{sign}(r(x) - (1/2))(\bar{r}(x) - 1/2)}{\sigma(x)}\right) \to 1$ if $h(x) = h^*(x)$ (as the numerator would be positive), and $0$ otherwise. Thus,

$$ \mathbb{P}(Y \ne h(x)) \to \begin{cases} |2r(x) - 1| + \mathbb{P}(Y \ne h^*(x)) & h(x) \ne h^*(x) \\ \mathbb{P}(Y \ne h^*(x)) & h(x) = h^*(x) \\ \end{cases} $$