1. In epsilon-greedy action selection, for the case of two actions and $\epsilon$=0.5, what is the probability that the greedy action is selected?
Solution: 0.75, because there is a 50% chance we deliberately choose the greedy action, and a 50% chance we choose randomly between the two.
2. Bandit example: Consider a $k$-armed bandit problem with $k=4$ actions, denoted 1, 2, 3 and 4. Consider applying to this problem a bandit algorithm using $\epsilon$-greedy action selection, sample-average action-value estimates, and initial estimates of $Q_1(a) = 0$, for all $a$. Suppose the initial sequence of actions and rewards is $A_1 = 1$, $R_1 = -1$, $A_2 = 2$, $R_2 = 1$, $A_3 = 2$, $R_3 = -2$, $A_4 = 2$, $R_4 = 2$, $A_5 = 3$, $R_5 = 0$. On some of these time steps the $\epsilon$ case may have occurred, causing an action to be selected at random. On which time steps did this definitely occur? On which time steps could this possibly have occurred?
Solution: We can only say for sure the $\epsilon$ case occured if the greedy action is not chosen (i.e., to choose an action whose estimated value is not the largest). This occurs on time step 4, when action 2 is chosen, but the estimated values of action 3 and 4 (i.e., 0) are larger than that action 2 (-0.5).
The $\epsilon$ case could, in principle, have occured on any time step, and it is impossible to rule it out for any step except the last one.
3. In the comparison shown in Figure 2.2, which method will perform best in the long run in terms of cumulative reward and probability of selecting the best action? How much better will it be? Express your answer quantitatively.
Solution: Eventually, the $\epsilon = 0.01$ method will only choose a suboptimal action 1% of the time, whereas the $\epsilon = 0.1$ method will choose a suboptimal action 10% of the time. Thus, lower $\epsilon$ is better if you have the time to explore.
The greedy method has a chance of outperforming either, if the estimated action value of the optimal action happens to be largest when it stops exploring. Initialization can be set to improve the probability of this occuring, but it is not guaranteed.
4. If the step-size parameters, $\alpha_n$, are not constant, then the estimate $Q_n$ is a weighted average of previously received rewards with a weighting different from that given by (2.6). What is the weighting on each prior reward for the general case, analogous to (2.6), in terms of the sequence of step-size parameters?
Solution:
\begin{align*} Q_{n+1} &= Q_n + \alpha_n [R_n - Q_n] \\ &= \alpha_n R_n + (1-\alpha_n) Q_n \\ &= \alpha_n R_n + (1-\alpha_n) [\alpha_{n-1} R_{n-1} + (1-\alpha_{n-1}) Q_{n-1}] \\ &= \alpha_n R_n + \prod_{i=1}^{n} (1-\alpha_i) Q_1 + \sum_{i=1}^{n} \alpha_i \left[ \prod_{j=i+1}^{n} (1-\alpha_j) \right] R_i \end{align*}
5. Design and conduct an experiment to demonstrate the difficulties that sample-average methods have for nonstationary problems. Use a modified version of the 10-armed testbed in which all the $q_*(a)$ start out equal and then take independent random walks (say by adding a normally distributed increment with mean 0 and standard deviation 0.01 to all the $q_*(a)$ on each step). Prepare plots like Figure 2.2 for an action-value method using sample averages, incrementally computed, and another action-value method using a constant step-size parameter, $\alpha = 0.1$. Use $\epsilon = 0.1$ and longer runs, say of 10,000 steps.
Solution:
import numpy as np
from matplotlib import pyplot as plt
np.random.seed(42)
# Simulation parameters
k = 10
N_STEPS = 10_000
N_RUNS = 100
epsilon = 0.1
alpha = 0.1
methods = ["Sample Average", "Constant Step Size"]
# Action values, which take random walks
q_star = np.cumsum(np.random.normal(loc=0, scale=0.01, size=(N_RUNS, k, N_STEPS)), axis=2)
q_star[:, 0, :] = np.zeros((N_RUNS, N_STEPS)) # set initial action values to 0
# Run simulations
avg_reward = {method: np.zeros(N_STEPS) for method in methods}
opt_action = {method: np.zeros(N_STEPS) for method in methods}
for method in methods:
for i in range(N_RUNS):
q = np.zeros(k) # action value estimates
n = np.zeros(k) # number of times each action was taken
reward = np.zeros(N_STEPS)
action = np.zeros(N_STEPS)
for j in range(N_STEPS):
# epsilon-greedy action selection
if np.random.random() < epsilon:
a = np.random.randint(k)
else:
a = np.argmax(q)
# reward is sampled from a normal distribution with mean q*(a) and variance 1
r = q_star[i, a, j] + np.random.normal(loc=q_star[i, a, j], scale=1)
# update action value estimates
n[a] += 1
if method == "Sample Average":
q[a] += (r - q[a]) / n[a]
else: # constant step size
q[a] += alpha * (r - q[a])
# log reward and action
reward[j] += r
action[j] += a
avg_reward[method] += reward
opt_action[method] += action == np.argmax(q_star[i], axis=0) # optimal action is the one with the highest action value
avg_reward[method] /= N_RUNS
opt_action[method] /= N_RUNS
# Plot results
plt.figure(figsize=(20, 5))
plt.subplot(1, 2, 1)
plt.plot(avg_reward["Sample Average"], label="Sample Average")
plt.plot(avg_reward["Constant Step Size"], label="Constant Step Size")
plt.xlabel("Steps")
plt.ylabel("Average Reward")
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(opt_action["Sample Average"], label="Sample Average")
plt.plot(opt_action["Constant Step Size"], label="Constant Step Size")
plt.xlabel("Steps")
plt.ylabel("% Optimal Action")
plt.legend()
plt.show()
As illustrated in the above figure, the sample-average method performs worse than the constant step-size method, because it struggles to adapt to the changing environment. Meanwhile, the constant step-size method, or exponential recency-weighted average, is able to adapt to the changing environment.
6. (Mysterious Spikes) The results shown in Figure 2.3 should be quite reliable because they are averages over 2000 individual, randomly chosen 10-armed bandit tasks. Why, then, are there oscillations and spikes in the early part of the curve for the optimistic method? In other words, what might make this method perform particularly better or worse, on average, on particular early steps?
Solution: Since the initial estimates are significantly larger than the true values, the first ten greedy steps are probably trials of each of the ten levers. Then, on the eleventh step, the greedy method will choose the lever with the highest estimated value, which is more likely to be the optimal lever than the other levers. This will cause a spike in the reward at the eleventh step, but as it's still "disappointing", the estimated value of the optimal lever will decrease, and the greedy method will continue to explore for a while. Thus, the twelveth step will likely not be the optimal action, causing a sharp downturn in the reward. The effect will attenuate as the estimated values approach the true values.
7. (Unbiased Constant-Step-Size Trick) In most of this chapter we have used sample averages to estimate action values because sample averages do not produce the initial bias that constant step sizes do (see the analysis leading to (2.6)). However, sample averages are not a completely satisfactory solution because they may perform poorly on nonstationary problems. Is it possible to avoid the bias of constant step sizes while retaining their advantages on nonstationary problems? One was is to use a step size of
$$\beta_n = \alpha / \bar{o}_n$$
to process the $n^{\text{th}}$ reward for a particular action, where $\alpha > 0$ is a conventional constant step size, and $\bar{o}_n$ is a trace of one that starts at 0:
$$\bar{o}_n = \bar{o}_{n-1} + \alpha (1 - \bar{o}_{n-1}), \text{ for } n \geq 0, \text{ with } \bar{o}_0 = 0.$$
Carry out an analysis like that in (2.6) to show that $Q_n$ is an exponential recency-weighted average without initial bias.
Solution:
Recalling the formula derived in Problem #4 for the general case of non-constant step sizes, we have:
\begin{align*} Q_{n+1} &= \beta_n R_n + \prod_{i=1}^{n} (1 - \beta_i) Q_1 + \sum_{i=1}^{n} \beta_i \left[ \prod_{j=i+1}^{n} (1-\beta_j) \right] R_i \\ &= \beta_n R_n + \sum_{i=1}^{n} \beta_i \left[ \prod_{j=i+1}^{n} (1-\beta_j) \right] R_i \tag{since $\beta_1 = \alpha / \bar{o}_1 = \alpha / \alpha = 1$}. \\ \end{align*}
Observe there is no dependence on the initial estimate, $Q_1$.
8. (UCB Spikes) In Figure 2.4 the UCB algorithm shows a distinct spike in performance on the 11th step. Why is this? Note that for your answer to be fully satisfactory it must explain both why the reward increases on the 11th step and why it decreases on the subsequent steps. Hint: If $c = 1$, then the spike is less prominent.
Solution: The explanation is similar to that of the optimistic method: the first ten steps are likely to be trials of each of the ten levers, and the eleventh step is likely to be the optimal lever (likely having been the most promising). However, if this happens, this means at the twelfth step, the optimal lever is the only one to have been attempted more than once, making it much less likely to be chosen.
9. Show that in the case of two actions, the soft-max distribution is the same as that given by the logistic, or sigmoid, function often used in statistics and artificial neural networks.
Solution:
Let $H_1$ and $H_2$ be the estimated values of the two actions. Then, the probability of choosing action 1 is:
\begin{align*} \pi_1 &= \frac{e^{H_1}}{e^{H_1} + e^{H_2}} \\ &= \frac{1}{1 + e^{H_2 - H_1}} \\ &= \sigma(H_1 - H_2). \end{align*}
10. Suppose you face a 2-armed bandit task whose true action values change randomly from time step to time step. Specifically, suppose that, for any time step, the true values of actions 1 and 2 are respectively 10 and 20 with probability 0.5 (case A), and 90 and 80 with probability 0.5 (case B). If you are not able to tell which case you face at any step, what is the best expected reward you can achieve and how should you behave to achieve it? Now suppose that on each step you are told whether you are facing case A or case B (although you still don't know the true action values). This is an associative search task. What is the best expected reward you can achieve in this task, and how should you behave to achieve it?
Solution: In the first case, the best expected reward is 50. All strategies have equal expected reward, so any strategy (e.g., always choose action 1) is sufficient. In the second case, the optimal strategy is to always choose action 2 in case A, and always choose action 1 in case B, and the expected reward is $20 \cdot \mathbb{P}(A) + 90 \cdot \mathbb{P}(B) = 55$.