1. Devise three simple tasks of your own that fit into the MDP framework, identifying for each its states, actions, and rewards. Make the three tasks as different from each other as possible. The framwork is abstract and flexible and can be applied in many different ways. Stretch its limits in some way in at least one of your examples.
Solution:
# 1. Building a fire. The states are the amount of wood, the amount of matches, and the amount of kindling. The actions are to add wood, add kindling, light the fire, blow, and move around the wood. The reward is the amount of heat generated.
# 2. Writing a poem. The states are the amount of paper, the amount of ink, and the amount of inspiration. The actions are to write, to think, to read, to erase, and to crumple. The reward is the quality of the poem (as judged by a panel of experts). I would say I'm stretching the limits here by making the reward a subjective measure.
# 3. Playing a game of chess. The states are the positions of the pieces on the board. The actions are to move a piece, to capture a piece, to castle, to resign, and to offer a draw. The reward is determined by the outcome of the game (win, loss, or draw).
2. Is the MDP framework adequate to usefully represent all goal-directed learning tasks? Can you think of any clear exceptions?
Solution:
I shall take "usefully represent" to mean that one could actually use MDPs to solve such tasks, rather than simply use it to represent the problem. We have the following practical limitations.
- Possible actions and or states are infinite, unknown, or difficult to represent.
- Progress towards some goals cannot be represented as a scalar, e.g.:
- Difficult to assign value in a consistent way. (e.g., writing a 'good' poem, being a 'good' person, etc.)
- Multiple objectives (e.g., maximizing profits while expanding market share)
- One might reply this is simply failure to coherently define the reward function, and not a limitation of the MDP framework.
3. Consider the problem of driving. You could define the actions in terms of the accelerator, steering wheel, and brake, that is, where your body meets the machine. Or you could define them farther out—say, where the rubber meets the road, considering your actions to be tire torques. Or you could define them farther in—say, where your brain meets your body, the actions being muscle twitches to control your limbs. Or you could go to a really high level and say that your actions are your choices of where to drive. What is the right level, the right place to draw the line between agent and environment? On what basis is one location of the line to be preferred over another? Is there any fundamental reason for preferring one location over another, or is it a free choice?
Solution:
The right level is the one that makes the problem tractable and accurate enough to be useful. The choice of level is a free choice, but it is constrained by the problem at hand. For example, if we are trying to build a self-driving car that takes the route as a given, we would probably want to define the actions in terms of the accelerator, steering wheel, and brake. If we are trying to build a self-driving car that chooses the route, we would probably want to define the actions in terms of the choices of where to drive.
4. Give a table analogous to that in Example 3.3, but for $p(s', r | s, a)$. Is should have columns for $s$, $a$, $s'$, $r$, and $p(s', r | s, a)$, and a row for every 4-tuple for which $p(s', r | s, a) > 0$.
Solution:
$s$ | $a$ | $s'$ | $r$ | $p(s', r \mid s, a)$ |
---|---|---|---|---|
high | search | high | $r_{\text{search}}$ | $\alpha$ |
high | search | low | $1 - r_{\text{search}}$ | $1 - \alpha$ |
high | wait | high | $r_{\text{wait}}$ | 1 |
low | search | high | -3 | $1 - \beta$ |
low | search | low | $r_{\text{search}}$ | $\beta$ |
low | wait | low | $r_{wait}$ | 1 |
low | recharge | high | 0 | 1 |
5. The equations in Section 3.1 are for the continuing case and need to be modified (very slightly) to apply to episodic tasks. Show that you know the modifications needed by giving the modified version of (3.3).
Solution:
The modified version of (3.3) is:
$$ \sum_{s' \in \mathcal{S}^+} \sum_{r \in \mathcal{R}} p(s', r \mid s, a) = 1 \quad \text{for all } s \in \mathcal{S}, a \in \mathcal{A}(s) $$
Note that the only difference is that the sum over $s'$ is now over $\mathcal{S}^+$ instead of $\mathcal{S}$.
6. Suppose you treated pole-balancing as an episodic task but also used discounting, with all rewards zero except for -1 upon failure. What then would the return be at each time? How does this return differ from that in the continuing formulation of this task?
Solution:
The return would be $-\gamma^{K-1}$, where $K$ is the number of time steps before failure. In the continuing task, the return would be $$\sum_{i=1}^{\infty} - \gamma^{K_i-1}$$ where $K_i$ is the number of time steps in the $i$-th episode before failure.
7. Imagine that you are designing a robot to run a maze. You decide to give it a reward of +1 for escaping from the maze and a reward of zero at all other times. The task seems to break down naturally into episodes—the successive runs through the maze—so you decide to treat it as an episodic task, where the goal is to maximize expected total reward (3.7). After running the learning agent for a while, you find that it is showing no improvement in escaping from the maze. What is going wrong? Have you effectively communicated to the agent what you want it to achieve?
Solution:
The agent has not received a reward signal that is informative enough to guide it towards the goal. The reward signal is too sparse. The agent has not received any feedback about its progress towards the goal, and so it has no way of knowing whether it is doing the right thing or not. The agent has not been effectively communicated what it should achieve.
8. Suppose $\gamma = 0.5$ and the following sequence of rewards is received: $R_1 = -1$, $R_2 = 2$, $R_3 = 6$, $R_4 = 3$, and $R_5 = 2$, with $T = 5$. What are $G_0$, $G_1$, $\dots$, $G_5$? Hint: Work backwards.
Solution:
We have:
$$ \begin{align*} G_5 &\text{ undefined} \\ G_4 &= R_5 = 2 \\ G_3 &= R_4 + \gamma G_4 = 3 + 0.5 \cdot 2 = 4 \\ G_2 &= R_3 + \gamma G_3 = 6 + 0.5 \cdot 4 = 8 \\ G_1 &= R_2 + \gamma G_2 = 2 + 0.5 \cdot 8 = 6 \\ G_0 &= R_1 + \gamma G_1 = -1 + 0.5 \cdot 6 = 2 \end{align*} $$
9. Suppose $\gamma = 0.9$ and the reward sequence is $R_1 = 2$ followed by an infinite sequence of 7s. What are $G_1$ and $G_0$?
Solution:
$$ \begin{align*} G_1 &= R_2 + \gamma \cdot R_3 + \gamma^2 \cdot R_4 + \dots \\ &= \sum_{k=0}^{\infty} \gamma^k R_{k+1} \\ &= \sum_{k=0}^\infty (0.9)^k \cdot 7 \\ &= 7 \cdot (1 / (1-0.9)) = 70 \end{align*} $$
$$G_0 = R_1 + \gamma G_1 = 2 + 0.9 \cdot 70 = 65$$
10. Prove the second equality in (3.10):
For $\gamma \in [0, 1)$,
$$ \sum_{k=0}^\infty \gamma^k = \frac{1}{1 - \gamma} $$
Solution:
Let $s_n = \sum_{k=0}^n \gamma^k$. Then we have:
$$ \begin{align*} \gamma s_n - s_n &= \gamma \sum_{k=0}^n \gamma^k - \sum_{k=0}^n \gamma^k \\ \Rightarrow s_n (\gamma - 1) &= \gamma^{n+1} - 1 \\ \Rightarrow s_n &= \frac{\gamma^{n+1} - 1}{\gamma - 1} \rightarrow \frac{1}{1 - \gamma} \end{align*} $$
11. If the current state is $S_t$, and actions are selected according to a stochastic policy $\pi$, then what is the expectation of $R_{t+1}$ in terms of $\pi$ and the four-argument function $p$ (3.2)?
Solution:
$$\mathbb{E}[R_{t+1} \mid S_t = s, \pi] = \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \sum_{s' \in \mathcal{S}} \sum_{r \in \mathcal{R}} r \cdot p(s', r \mid s, a)$$
12. Give an equation for $v_\pi$ in terms of $q_\pi$ and $\pi$.
Solution:
By Adam's law,
$$ \begin{align*} v_\pi(s) &= \mathbb{E}_\pi \left[ G_t \mid S_t = s \right] \\ &= \mathbb{E}_{a \sim \pi} \left[ \mathbb{E}_{\pi} \left[ G_t \mid S_t = s, A_t = a \right] \right] \\ &= \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) q_\pi(s, a) \end{align*} $$
13. Give an equation for $q_\pi$ in terms of $v_\pi$ and the four-argument $p$.
Solution:
$$ \begin{align*} q_\pi(s, a) &= \mathbb{E}_\pi \left[ G_t \mid S_t = s, A_t = a \right] \\ &= \sum_{r} \sum_{s'} (r + \gamma v_\pi(s')) p(s', r \mid s, a) \end{align*} $$
14. The Bellman equation (3.14) must hold for each state for the value function $v_\pi$ shown in Figure 3.2. Show numerically that this equation holds for the center state, valued at $+0.7$, with respect to its four neighboring states, valued at $+2.3$, $+0.4$, $-0.4$, and $+0.7$. (These numbers are accurate only to one decimal place.)
Solution:
The Bellman Equation:
$$ \begin{align*} v_\pi(s) &= \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) q_\pi(s, a) \\ &= \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \sum_{r} \sum_{s'} (r + \gamma v_\pi(s')) p(s', r \mid s, a) \end{align*} $$
In this case, we have:
$$ \begin{align*} v_\pi(s) &= \pi(\text{up} \mid s) \cdot \left( 0 + 0.9 \cdot 2.3 \right) + \pi(\text{down} \mid s) \cdot \left( 0 + 0.9 \cdot 0.4 \right) + \pi(\text{right} \mid s) \cdot \left( 0 + 0.9 \cdot (-0.4) \right) + \pi(\text{left} \mid s) \cdot \left( 0 + 0.9 \cdot 0.7 \right) \\ &= 0.25 \cdot 0.9 \cdot (2.3 + 0.4 - 0.4 + 0.7) \\ &= 0.675 \approx 0.7 \end{align*} $$
15. In the gridworld example, rewards are positive for goals, negative for running into the edge of the world, and zero the rest of the time. Are the signs of these rewards important, or only the intervals between them? Prove, using (3.8), that adding a constant $c$ to all the rewards adds a constant $v_c$ to the values of all states, and thus does not affect the relative values of any states under any policies.
Solution:
Let $v'_\pi$ be the value function for the modified rewards, and $v_\pi$ be the value function for the original rewards. Similarly let $G'_t$ be the return for the modified rewards, and $G_t$ be the return for the original rewards. Then we have:
$$ \begin{align*} G'_t &= (R_{t+1} + c) + \gamma (R_{t+2} + c) + \gamma^2 (R_{t+3} + c) + \dots \\ &= \sum_{k=0}^\infty \gamma^k (R_{t+k+1} + c) \\ &= G_t + \frac{c}{1 - \gamma} \\ &= G_t + v_c \end{align*} $$
where $v_c = \frac{c}{1 - \gamma}$. By Def 3.12, we have
$$ \begin{align*} v'_\pi(s) &= \mathbb{E}_\pi \left[ G'_t \mid S_t = s \right] \\ &= \mathbb{E}_\pi \left[ G_t + v_c \mid S_t = s \right] \\ &= v_\pi(s) + v_c \end{align*} $$
16. Now consider adding a constant $c$ to all the rewards in an episodic task, such as maze running. Would this have any effect, or would it leave the task unchanged as in the continuing task? Why?
Solution:
It would have an effect. If it made all rewards positive, the robot is no longer incentivized to escape the maze, and can simply wander around aimlessly collecting rewards.
17. What is the Bellman equation for action values, that is, for $q_\pi$? It must give the action value $q_\pi(s, a)$ in terms of the action values, $q_\pi(s', a')$, of possible successors to the state–action pair $(s, a)$.
Solution:
$$ \begin{align*} q_\pi(s, a) &= \mathbb{E}_\pi \left[ G_t \mid S_t = s, A_t = a \right] \tag{Def. 3.13} \\ &= \mathbb{E}_\pi \left[ R_{t + 1} + \gamma G_t \mid S_t = s, A_t = a \right] \tag{Eqn. 3.9} \\ &= \sum_{s', r} p(s', r \mid s, a) \left[ r + \gamma \mathbb{E}_\pi \left[ G_{t+1} \mid S_{t+1} = s' \right] \right] \\ &= \sum_{s', r} p(s', r \mid s, a) \left[ r + \gamma \sum_{a'} \pi(a' \mid s') \mathbb{E}_\pi \left[ G_{t+1} \mid S_{t+1} = s', A_{t+1} = a' \right] \right] \\ &= \sum_{s', r} p(s', r \mid s, a) \left[ r + \gamma \sum_{a'} \pi(a' \mid s') q_\pi(s', a') \right] \tag{Def. 3.13} \end{align*} $$
18. The value of a state depends on the values of the actions possible in that state and on how likely each action is to be taken under the current policy. We can think of this in terms of a small backup diagram rooted at the state and considering each possible action. Give the equation corresponding to this intuition and diagram for the value at the root node, $v_\pi(s)$, in terms of the value at the expected leaf node, $q_\pi(s, a)$, given $S_t = s$. This equation should include an expectation conditioned on following the policy, $\pi$. Then give a second equation in which the expected value is written out explicitly in terms of $\pi(a \mid s)$ such that no expected value notation appears in the equation.
Solution:
The first equation is:
$$v_\pi(s) = \mathbb{E}_\pi \left[ q_\pi(s, A_t) \mid S_t = s \right]$$
The second equation is:
$$v_\pi(s) = \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) q_\pi(s, a)$$
19. The value of an action, $q_\pi(s, a)$, depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small backup diagram, this one rooted at an action. (state-action pair) and branching to the possible next states. Give the equation corresponding to this intuition and diagram for the action value, $q_\pi(s, a)$, in terms of the expected next reward, $R_{t+1}$, and the expected next state value, $v_\pi(S_{t+1})$, given that $S_t = s$ and $A_t = a$. This equation should include an expectation but not one conditioned on following the policy. Then give a second equation, writing out the expected value explicitly in terms of $p(s', r \mid s, a)$ defined by (3.2), such that no expected value notation appears in the equation.
Solution:
The first equation is:
$$q_\pi(s, a) = \mathbb{E} \left[ R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s, A_t = a \right]$$
The second equation is:
$$q_\pi(s, a) = \sum_{s', r} p(s', r \mid s, a) \left[ r + \gamma v_\pi(s') \right]$$
20. Draw or describe the optimal state-value function for the golf example.
Solution:
The optimal state-value function for the golf example is a function that assigns to each state the expected return that can be obtained from that state under the optimal policy. The optimal policy is to use the putter on the green, and the driver otherwise. Then, the expected return, and hence the value, is:
- -1 on the green
- -2 on the region surrounding the green such that the green is reachable with the driver
- -3 on the region surrounding the above region such that the above region is reachable with the driver
- etc.
21. Draw or describe the contours of the optimal action-value function for putting, $q_*(s, \text{putter})$, for the golf example.
Solution:
- -1 on the green
- -2 on the region surrounding the green such that the green is reachable with the putter
- -3 on the region surrounding the above region such that the above region is reachable with the putter
- etc.
Note that the contours around the green will be tighter than those for $q_*(s, \text{driver})$, since the driver can reach farther than the putter.
22. Consider the continuiing MDP shown to the right. THe only decision to be made is that in the top state, where two actions are available, $\texttt{left}$ and $\texttt{right}$. The numbers show the rewards that are received deterministically after each action. There are exactly two deterministic policies, $\pi_{\texttt{left}}$ and $\pi_{\texttt{right}}$. What policy is optimal if $\gamma = 0$? If $\gamma = 0.9$? If $\gamma = 0.5$?
Solution:
The Bellman Optimality Equation for the state-value function at the center node is:
$$ \begin{align*} v_*(\text{center}) &= \max_{a \in \mathcal{A}(\text{center})} \sum_{s', r} p(s', r \mid \text{center}, a) \left[ r + \gamma v_*(s') \right] \\ &= \max \left\{ 1 + \gamma v_*(\text{left}), 0 + \gamma v_*(\text{right}) \right\} \\ \end{align*} \tag{*} $$
At the left and right nodes, there are no decisions to be made, so the Bellman Optimality Equation is simply:
$$ \begin{align*} v_*(\text{left}) &= \gamma v_*(\text{center}) \\ v_*(\text{right}) &= 2 + \gamma v_*(\text{center}) \end{align*} \tag{**} $$
Plugging the values from $\text{**}$ into $\text{(*)}$, we have:
$$ \begin{align*} v_*(\text{center}) &= \max \left\{ 1 + \gamma \gamma v_*(\text{center}), 0 + \gamma (2 + \gamma v_*(\text{center})) \right\} \\ &= \max \left\{ 1 + \gamma^2 v_*(\text{center}), 2\gamma + \gamma^2 v_*(\text{center}) \right\} \\ \end{align*} $$
Thus, the optimal policy is to get left if $1 \ge 2\gamma$, and right otherwise. Thus, the optimal policy is $\pi_{\text{left}}$ if $\gamma \ge 0.5$, and $\pi_{\text{right}}$ otherwise. At $\gamma =0.5$, the policy is indifferent.
23. Give the Bellman equation for $q_*$ for the recycling robot.
Solution:
The Bellman Equation for $q_*$ is:
$$ q_*(s, a) = \sum_{s', r} p(s', r \mid s, a) \left[ r + \gamma \max_{a' \in \mathcal{A}(s')} q_*(s', a') \right] $$
For the recycling robot, letting $h$ and $l$ be the high and low states, and $s$, $w$, and $r$ be the search, wait, and recharge actions, we have:
$$ \begin{align*} q_*(h, s) &= \alpha (r_{s} + \gamma \max_{a'} q_*(h, a')) + (1 - \alpha) (r_{s} + \gamma \max_{a'} q_*(l, a')) \\ q_*(h, w) &= 1 \cdot (r_{w} + \gamma \max_{a'} q_*(h, a')) \\ q_*(l, s) &= \beta (r_{s} + \gamma \max_{a'} q_*(l, a')) + (1 - \beta) (-3 + \gamma \max_{a'} q_*(h, a')) \\ q_*(l, w) &= 1 \cdot (r_{w} + \gamma \max_{a'} q_*(l, a')) \\ q_*(l, r) &= 1 \cdot (0 + \gamma \max_{a'} q_*(h, a')). \end{align*} $$
24. Figure 3.5 gives the optimal value of the best state of the gridworld as 24.4, to one decimal place. Use your knowledge of the optimal policy and (3.8) to express this value symbolically, and then to compute it to three decimal places.
Solution:
In state $A$, following the optimal policy, the reward sequence is: $10, 0, 0, 0, 0, 10, 0, 0, 0, 0, 10, \dots$. With a discounting factor of $\gamma = 0.9$,
$$ \begin{align*} v_*(A) &= \mathbb{E}_{\pi_*} \left[ G_t \mid S_t = A \right] \\ &= \sum_{k=0}^\infty \gamma^{5k} \cdot 10 \\ &= 10 / (1 - 0.9^5) \\ &\approx 24.419 \end{align*} $$
25. Give an equation for $v_*$ in terms of $q_*$.
Solution:
$$v_*(s) = \max_{a \in \mathcal{A}(s)} q_*(s, a)$$
26. Give an equation for $q_*$ in terms of $v_*$ and the four-argument $p$.
Solution:
$$q_*(s, a) = \sum_{s', r} p(s', r \mid s, a) \left[ r + \gamma v_*(s') \right]$$
27. Give an equation for $\pi_*$ in terms of $q_*$.
Solution:
$$\pi_*(s) = \arg \max_{a \in \mathcal{A}(s)} q_*(s, a)$$
28. Give an equation for $\pi_*$ in terms of $v_*$.
Solution:
$$\pi_*(s) = \arg \max_{a \in \mathcal{A}(s)} \sum_{s', r} p(s', r \mid s, a) \left[ r + \gamma v_*(s') \right]$$
29. Rewrite the four Bellman equations for the four value functions ($v_*$, $q_*$, $v_\pi$, and $q_\pi$) in terms of the three argument function $p$ (3.4) and the two-argument function $r$ (3.5).
Solution:
$$ \begin{align*} v_\pi(s) &= \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) \left[ r + \gamma v_\pi(s') \right] \\ &= \sum_{a \in \mathcal{A}(s)} \sum_{r} \sum_{s'} \pi(a \mid s) p(s', r \mid s, a) \left[ r + \gamma v_\pi(s') \right] \\ &= \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \left[ \sum_{r} r \sum_{s'} p(s', r \mid s, a) + \gamma \sum_{s'}v_\pi(s') \sum_r p(s', r \mid s, a)\right] \\ &= \sum_{a \in \mathcal{A}(s)} \pi(a \mid s) \left[ r(s,a) + \gamma \sum_{s'} v_\pi(s') p(s' \mid s, a) \right] \\ \end{align*} $$
$$ \begin{align*} v_*(s) = \max_{a \in \mathcal{A}(s)} \left[ r(s,a) + \gamma \sum_{s'} v_*(s') p(s' \mid s, a) \right] \end{align*} $$
$$ \begin{align*} q_\pi(s, a) &= \sum_{s', r} p(s', r \mid s, a) \left[ r + \gamma \sum_{a'} \pi(a' \mid s') q_\pi(s', a') \right] \\ &= r(s, a) + \gamma \sum_{s'} p(s' \mid s, a) \left[ \sum_{a'} \pi(a' \mid s') q_\pi(s', a') \right] \end{align*} $$
$$ q_*(s, a) = r(s, a) + \gamma \sum_{s', r} p(s' \mid s, a) \max_{a'} q_*(s', a') $$