From Sutton and Barto, p. 81:

*Jack manages two locations for a nationwide car rental company. Each day, some number of customers arrive at each location to rent cars. If Jack has a car available, he rents it out and is credited 10 dollars by the national company. If he is out of cars at that location, then the business is lost. Cars become available for renting the day after they are returned. To help ensure that cars are available where they are needed, Jack can move them between the two locations overnight, at a cost of 2 dollars per car moved. We assume that the number of cars requested and returned at each location are Poisson random variables, meaning that the probability that the number is n is $\frac{\lambda^n}{n!}e^{-\lambda}$, where $\lambda$ is the expected number. Suppose $\lambda$ is 3 and 4 for rental requests at the first and second locations and 3 and 2 for returns. To simplify the problem slightly, we assume that there can be no more than 20 cars at each location (any additional cars are returned to the nationwide company, and thus disappear from the problem) and a maximum of five cars can be moved from one location to the other in one night. We take the discount rate to be $\gamma = 0.9$ and consider the problem of finding the optimal policy for maximizing the expected total reward.*

The state is the number of cars at each location. The action is the number of cars to move from one location to the other. The reward is the profit from renting cars minus the cost of moving cars.

We identify the optimal policy using the policy iteration algorithm. We shall begin by reviewing the algorithm.

Letting $\pi(a \mid s)$ denote the probability of taking action $a$ in state $s$ under policy $\pi$, the state-value function $v_{\pi}(s)$ is the expected return starting from state $s$, and then following policy $\pi$. That is,

\[v_{\pi}(s) = \mathbb{E}_{\pi}[G_t | S_t = s],\]where $G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots$ is the return, $R_k$ is the reward at time $k$, and $\gamma$ is the discount factor. Discounting is used to ensure that the return is finite, and to give more weight to immediate rewards. However, the particular value of $\gamma$ is arbitrary and must be specified in advance.

Letting $p(s’, r | s, a)$ denote the probability of transitioning to state $s’$ and receiving reward $r$ given that we are in state $s$ and take action $a$, and letting $\gamma$ denote the discount factor, the Bellman Equation states:

\[v_{\pi}(s) = \sum_{a} \pi(a|s) \sum_{s', r} p(s', r|s, a)[r + \gamma v_{\pi}(s')]\]which expresses the value of any state $s$ in terms of the expected value of the next state $s’$ and the expected reward $r$.

Provided a policy $\pi$, we can convert this into an iterative algorithm to find the corresponding value function, $v_{\pi}$. We start with an initial guess (e.g., $v_{\pi}^0 = \mathbf{0}$), and then repeatedly apply the Bellman Equation to each state until the value function converges. Explicitly, starting from $n=0$ we compute

\[v_{\pi}^{n+1}(s) = \sum_{a} \pi(a|s) \sum_{s', r} p(s', r|s, a)[r + \gamma v_{\pi}^n(s')]\]until $\max_s \lvert v_{\pi}^{n+1}(s) - v_{\pi}^n(s) \rvert < \theta$ for some small $\theta$. This is called **iterative policy evaluation**, as it evaluates the value of the policy at each state.

Once we have the value function $v_{\pi}$, we can use it to improve the policy by acting greedily with respect to it. Explicitly, we can define a new policy $\pi’$ such that $\pi’(s) = \arg\max_a \sum_{s’, r} p(s’, r \mid s, a)[r + \gamma v_{\pi}(s’)]$. This is called **policy improvement**. This new policy is guaranteed to be as good as or better than the old policy (with respect to the value function $v_{\pi}$).

We can then iterate between policy evaluation and policy improvement until the policy converges. This is called **policy iteration**. The pseudocode is as follows:

- Initialize $\pi$ arbitrarily
- Repeat until policy converges:
- Policy Evaluation:
- Initialize $v_{\pi}^0 = \mathbf{0}$ (or some other initial guess)
- Repeat until $\lvert v_{\pi}^{n+1}(s) - v_{\pi}^n(s) \rvert < \theta$ for all $s$:
- For each $s$:
- $v_{\pi}^{n+1}(s) = \sum_{a} \pi(a \mid s) \sum_{s’, r} p(s’, r \mid s, a)[r + \gamma v_{\pi}^n(s’)]$

- For each $s$:

- Policy Improvement:
- For each $s$:
- $\pi’(s) = \arg\max_a \sum_{s’, r} p(s’, r \mid s, a)[r + \gamma v_{\pi}(s’)]$

- If $\pi’ = \pi$, then stop, policy has converged.
- $\pi = \pi’$

- For each $s$:

- Policy Evaluation:

Starting from a policy that never moves cars ($\pi = \mathbf{0}$), the convergence of the first round of policy evaluation looks like the following:

Naturally, it’s better to have more cars available. After policy improvement, we get the new policy:

We then repeat the process, until we get the optimal policy:

This policy passes a sanity check: it is better to move cars from location 1 to location 2 when there are much more cars at location 1 than location 2, and vice versa. Furthermore, location 1 and 2 are natural “sink” and “source” locations, respectively, as the expected number of cars requested at location 1 is lower than the expected number of cars requested at location 2, and the expected number of cars returned at location 1 is higher than the expected number of cars returned at location 2. Thus, we are more likely to have a surplus of cars at location 1, and a deficit of cars at location 2, and the policy reflects this.

The code is available in this notebook.

- This is a “model-based” approach, as we need to know the transition probabilities $p(s’, r \mid s, a)$ to solve the problem. These values may be unknown or too multitudinous to compute in practice. Indeed, even in this toy problem, the most expensive step was construction of this table. For this reason, much of the book is devoted to “model-free” approaches, which do not require knowledge of the transition probabilities.
- Silent bugs are easy to introduce in this problem.
- There are subtle ambiguities in the problem statement. For example, while it is clear that the number of cars after a moving, renting, and returning iteration is clipped to 20, it is not clear if this value is limited after moving but before renting / returning. Indeed, in the second plot on p. 81 of Sutton and Barto, the policy says to move one car from location 1 to location 2 when there are 20 cars at both locations, which would mean 21 cars would be at location 2 after the move.

The aim is to create an application that takes in some user-entered text, and produces a prediction for what the next word is going to be. For instance, the user could type in “Friends, Romans, countrymen, lend me your”, and the application might suggest “ears” (or, “hand” or “car” or whatever).

The approach that text prediction models use is to imagine a probability distribution over all possible words, given a sequence of input words. For example, if the input words are “lend me your”, then perhaps the probability of “ears” is 90%, and the probability of “cat” is 10%, and the probability of all other words is 0% (the sum of the probabilities over all the words has to be 100% for it to be a probability distribution).

Let’s put this in math terms. Let $w_i$ denote the next word in the sequence (that is, the one we wish to predict), and let $w_{i-1}$ be the previous word, and $w_{i-2}$ the word before that, and so on ($w_0$ would be the word at the beginning of the user-entered text. So, in our example, $w_{i-1}$ is “your”, $w_{i-2}$ is “me”, and $w_{i-3}$ is “lend”. Then, we are searching for the word $w_i$ that maximizes the conditional probability:

\[P(w_i \mid w_0w_1\dots w_{i-1})\]So, “all” we need to do is calculate this conditional probability for every word ever. Then, we select the word with the greatest conditional probability as our prediction.

In most text prediction models, we start with some large source of text data called a corpus. For instance, we can use the complete works of Shakespeare. We then use the text in that corpus to develop our model. Our algorithm will, hopefully, predict words in such a way that the resulting text sounds “Shakespearean”, even if the specific text itself never appears in Shakespeare.

How can we use a corpus to determine the probability of a word, given a sequence of previous words? Consider calculating the probability of “head” given the input “off with his”. A natural way to do is to simply count the number of times the phrase “off with his head” appears in the corpus, and divide that by the number of times “off with his” appears, i.e.:

\[P(\text{head} \mid \text{off with his}) \approx \frac{\text{Count}(\text{off with his head})}{\text{Count}(\text{off with his})}\]Apparently, “off with his” appears four times in Shakespeare:

```
## [1]" QUEEN MARGARET. Off with his head, and set it on York gates;"
## [2] " For Somerset, off with his guilty head."
## [3] " Off with his head! Now by Saint Paul I swear"
## [4] " KING RICHARD. Off with his son George's head!"
```

In two of them, the next word is “head”, but in one it’s “guilty” and in another it’s “son” (they’re all getting at the same idea, of course). We can then approximate the probability distribution of the words following “off with his” by saying there’s a 50% chance it’ll be “head”, a 25% chance of “guilty”, a 25% chance of “son”, and a 0% chance of every other word ever to exist, i.e.,

\[P(w_i \mid \text{off with his}) = \begin{cases} 0.5 & w_i = \text{head} \\ 0.25 & w_i = \text{guilty} \\ 0.25 & w_i = \text{son} \\ 0 & \text{otherwise} \end{cases}\]To approximate the probabilities in this way is to use the Maximum Likelihood Estimate, which is to use the probability distribution which best matches the observed data. We can already see the issue with the approach. Our model completely rules out words it’s never seen after “off with his”. However, it’s a pretty good starting point; this kind of model will predict “head” when a user types “off with his”.

There’s another issue here. For many input phrases, there won’t be any instances from which to estimate probabilities. For instance, the phrase “i can’t afford a hundred thousand” never shows up! We can’t assign probabilities to any word, let alone make a prediction for the next word. However, the phrase “a hundred thousand” shows up nine times:

```
## [1] " MENENIUS. A hundred thousand welcomes. I could weep"
## [2] " And I will die a hundred thousand deaths"
## [3] " King. A hundred thousand rebels die in this!"
## [4] " Shall break into a hundred thousand flaws"
## [5] " The payment of a hundred thousand crowns;"
## [6] " A hundred thousand more, in surety of the which,"
## [7] " A hundred thousand crowns; and not demands,"
## [8] " On payment of a hundred thousand crowns,"
## [9] " As it should pierce a hundred thousand hearts;"
```

and, in three of the nine instances, the next word is “crowns”. For the other instances, the resulting sequence is the only time the phrase appears in the corpus. So, it might be reasonable to predict “crowns” as the next word.

We can observe the trade-off here. The more words in the input, the less likely it is that we’ll possess reliable data with which to calculate our predictions. Meanwhile, if we have fewer words in the input, there are more instances to work with, but they may not match our meaning as well (it doesn’t seem likely someone would say they can’t afford a hundred thousand welcomes, unless they’re explaining why they can’t be a hotel clerk).

When we use the probabilities generated by only looking at the most recent words, we’re making the following assumption:

\(P(w_i \mid w_0w_1\dots w_{i-1}) = P(w_i \mid w_{i-n+1}\dots w_{i-1})\) where $n$ is some positive integer. That is, we assume the probability of word $w_i$, given the input text, is only dependent on the most recent $n$ words. This is called a Markov assumption (an assumption of “memorylessness”); our probabilities completely ignore the earlier words in the text. For instance, if $n=2$, then we are saying the probability of observing $w_i$ only depends on the previous word, $w_{i-1}$, and when $n=3$, it only depends on the two most recent words: $w_{i-1}$ and $w_{i-2}$. We call these models “n-gram” models. When using $n=2$, we call it a bigram model, and when $n=3$, a trigram model.

Suppose we’re using a trigram model. That is, our prediction for the next word is going to be word $w_i$ that maximizes: \(P(w_i \mid w_{i-2}w_{i-1})\) We can estimate these probabilities by just looking at the counts:

\[P(w_i \mid w_{i-2}, w_{i-1}) = \frac{\text{Count}(w_{i-2}w_{i-1}w_i)}{\text{Count}(w_{i-2}w_{i-1})}\]For instance, predicting the next word in the phrase “have to”, we could look at all the sequences of three words (trigrams) starting with “have to”, count the instances for each, and divide each by the instances of the bigram “have to”. We can then produce the following plot of the five most probable words (as estimated by the trigram model):

and observe “do” is the most likely candidate. We can use a similar process for larger $n$-grams. As discussed above, we need to keep in mind the trade-off between supplied information and data availability.

Suppose we have a trigram model, and we want to find the probability of the next word after “have to” being “run”. The phrase “have to run” never shows up, so our trigram model gives it a probability of zero. However, this doesn’t seem reasonable. Indeed, the phrase “to run” appears 14 times, so, given the verb is used by Shakespeare’s characters, it seems possible that a new Shakespearean character would want to describe their obligation to run.

Suppose we have a trigram model and a bigram model. If we had an enormous corpus, the former would almost certainly be a better predictor, since it has more context. However, with a limited corpus size, the trigram model has insufficient data to create a prediction (i.e., when the trigram never appears in the corpus). Meanwhile, the bigram model will have much more data, and it will often be the case that though a trigram doesn’t appear, the bigram consisting of the latter two words does. This motivates the following scheme: use the trigram probability if we can, and if not, use the bigram probability.

This is known as a “back-off” model, where the “backing off” is us retreating from a trigram model to a less ambitious bigram model. Suppose we are using the back-off model, and the input phrase is “get thee”. Looking at the counts of all the words after “get thee”:

we see that “gone” appears 26 times, “to” 10 times, and so on (the word “<s>” refers to the end of the sentence.) In total, we have 16 unique choices that appear in Shakespeare. The trigram model would produce a probability distribution by simply rescaling the bars to that the heights of each of them sum to one, and creating a new bar of height 0 for all other words: However, we want to leave room for the possibility of a word besides the 16 possibilities. That is, we want to shift some of the probability mass in the above plot to “ALL OTHER WORDS”. This means we need to determine how much probability we want to give “ALL OTHER WORDS”, and how much we should take from the other possibilities. What we are doing here “smoothing” the distribution. There are a number of schemes for smoothing. We are going to use a method called Good-Turing smoothing (yes, that guy!).

Here’s the idea. We observe there are 70 instances of ‘get thee’ in Shakespeare. In 26 of them, the next word is ‘gone’, in ten of them, the next word is ‘to’. For seven words (‘apart’, ‘before’, ‘further’, ‘glass’, ‘home’, ‘with’ and ‘wood’), the instance of “get thee” + the word is the only such instance in the whole corpus. If one was reading Shakespeare’s works start to finish (a weekend well spent), of the 70 times they read ‘get thee’, there would be 7 instances where the following word made the phrase the unique instance of that phrase in the corpus. Thus, we might estimate that the probability of seeing a new word is 7/70. Indeed, in Good-Turing smoothing, that is the probability that we would assign “ALL OTHER WORDS”. However, we need to adjust the probabilities for the other possibilities, since they need to sum to 1.

How do we do this? Consider the word ‘apart’, which appears once in our reading after ‘get thee’. What is the probability of seeing it a second time (e.g., our user entering ‘get thee apart’)? In our reading, there were two words that were such that “get thee” + the word appeared twice, namely, “back” and “”. So, if you flip to a random page and see the word “get thee”, the probability that the next word is a word that appears twice in the corpus after “get thee” is $2 \cdot \frac{2}{70} = \frac{4}{70}$. Thus, we estimate the probability of the user entering “apart”, “before”, “further”, “glass”, “home”, “with”, or “food” to be 4/70. We shall give each of these possibilities equal weight, meaning the probability of “apart” is one seventh of $\frac{4}{70}$, $\frac{2}{245}$ (this is also the probability for the other six).

How about for the word “back”? It appears twice after “get thee” in our reading, and we want to estimate the probability of seeing it a third time. In the corpus, there are 2 words that appear 3 times after “get thee”, namely, “away” and “in”. Thus, given that you see the phrase “get thee”, the probability the next word is a word that appears 3 times after “get thee” is $2 \cdot \frac{3}{70} = \frac{6}{70}$. So, we estimate the probability of a user entering “back” or “is” to be $\frac{6}{70}$. Once again, we give both equal weight, and so we set the probability of seeing “back” after “get thee” as one half of $\frac{6}{70}$, $\frac{3}{70}$.

What’s the general formula here? When we estimated the probability of seeing “back” (a word that follows “get thee” twice), we looked at the number of words that follow “get thee” 3 times (that value is 2), and multiplied by 3. This number gives the number of times, when reading the works of Shakespeare over the weekend, that you read a word that follows “get thee” which appears 3 times after “get thee” in the corpus. Dividing by 70 (the number of times “get thee” shows up at all), we get the probability of such a thing happening. Finally, we divide this value by the number of words that appear twice after “get thee” in the corpus to give equal weight to both possibilities. In our case, we divided it by 2.

To get the formula, we introduce a little notation. Let $N_c$ be the number of words that show up $c$ times after “get thee” in the corpus, and let $N$ be the number of times “get thee” shows up in the corpus. So, we have $N_1=7$, $N_2=2$, $N_3=2$, $N_{26}=1$, etc. Our calculation was then: \(P(\text{back} \mid \text{get thee}) = \frac{3N_{3}}{N N_2}\) and in general, estimated probability of a user typing a word that shows up $c$ times after “get thee” (call this value $p_c$) is: \(p_c = \frac{(c+1) N_{c+1}}{N N_c}\) This works well when $c$ is small. What about when $c=26$? There are no words that show up 27 times after “get thee”, so $N_{27}=0$, meaning $p_{26}=0$, meaning we will give a probability of $0$ to seeing “gone”, which is clearly undesirable. Indeed, this will be true for any $c$ such that there are no instances of a word following “get thee” $c+1$ times.

To address this, we can use approximate values for $N_c$. There are a number of ways to do this. We shall use the approach of Church and Gale (1991). Consider the frequency plot for our “get thee” example: We begin by smoothing these bars. For each $c$ such that $N_c > 0$, we identify the highest $b$ such that $b < c$ and $N_b > 0$, and the smallest $d$ such that $c < d$ and $N_d > 0$. For instance, if $c=10$, then $b=6$, and $d=26$. Then, we define the approximate frequency: \(Z_c = \frac{N_c}{0.5(d-b)}.\)

If $c=1$, we just set $N_c = Z_c$. If $c$ is the maximum, we let $d$ be such that $d - c = c - b$. The idea of this approximation is the following. Partition the $c$ axis, where splits happen at the midpoints between those $c$ for which $N_c > 0$. In our above case, these splits happen at 1.5, 2.5, 4, 5.5, 8, and 18. For each interval, we assume the value inside the split is actually evenly spread around the interval. $Z_c$ is the height of the resulting rectangle. We get the resulting plot for $Z_c$ vs $c$: Then, a curve is fitted to these points, by assuming a power-law relationship between $Z_c$ and $c$. That is, assuming $Z_c = Ac^B$ for some constants $A$ and $B$. This can be done by performing linear regression on the equation $\log(Z_c)= B\log{c} + \log(A)$. Using the power-law, we can then calculate this smoothed value of $Z_c$ for all $c$: Now we have frequencies we can use to estimate our word probabilities. Let $c(w)$ be the number of times word $w$ shows up after “get thee”. If $c(w) > 0$, the smoothed “probability” is: \(p_s(w) = \frac{(c(w)+1)Z_{c(w)+1}}{Z_{c(w)} N},\) and if $c(w)=0$, the value is $\frac{Z_1}{N}$. However, these aren’t true probabilities, since they don’t add up to one (this happened because we approximated with $N_c$ with $Z_c$). So, we normalize by summing up the probabilities for all words, and dividing each probability by the result. Let $c(w)$ be the number of times word $w$ shows up after “get thee”. Letting $V$ be our list of possible next words, including the catch-all bucket “ALL OTHER WORDS”, we have: \(P_{GT}(w \mid \text{get thee}) = \frac{p_s(w)}{\sum_{w’ \in V} p_s(w’)}\) The Good-Turing distribution then looks like the following: So, it looks like we leave around a 10% chance that the next word isn’t among the 16 seen in the corpus, since around 10% of the time, a word is new!

What do we do from here? We now back-off to the bigram model. We shall look at all bigrams starting with “thee” such that the second word is not among the 16 words already assigned probabilities. The remaining 10% is then proportionally split among the new options. From the bigram model, we can also back-off to a unigram model in a similar way.

Long post, but I hope this gives a good feel for how Good-Turing smoothing works for those searching!

]]>where $\mathbf{x}_i$ is the vector of population levels for each league in week $i$ and $\mathbf{A}$ is a tridiagonal, irreducible, left-stochastic matrix representing the transitions. Starting from a distribution where everyone is in the bottom three leagues, and assuming no one enters or leaves, we observe the existing transition rules push populations towards the higher leagues:

We can calculate this steady state distribution using linear algebra. In particular, we identify the principal eigenvector of $\mathbf{A}$, which will be associated with an eigenvalue of 1.

A question I have: Can one construct a triadiagonal, irreducible left-stochastic matrix $\mathbf{A}$, given a specified principal eigenvector? This would be of use to Duolingo, if say, they wanted most users to be *near* the top (i.e., in the pearl and obsidian leagues).

One idea is to cast this as a constraint satisfaction problem. In particular, we can create the following formulation, where the decision variables, $a_{ij}$, are the entries of the matrix $\mathbf{A}$, $\mathbf{v}$ is the (given) dominant eigenvector, and $\epsilon > 0$ is a given minimum transition proportion:

\[\begin{align} \sum_{j=1}^n a_{ij}v_j &= v_i \quad &\forall i \in \{1,\dots,n\} \\ \sum_{i=1}^n a_{ij} &= 1 \quad &\forall j \in \{1,\dots,n\} \\ a_{ij} &= 0 \quad &\forall i,j \in \{1,\dots,n\}, |i-j| > 1 \\ a_{ij} &\ge \epsilon \quad &\forall i,j \in \{1, \dots, n\}, |i-j| \le 1 \\ 0 \le a_{ij} &\le 1 \quad &\forall i,j \in \{1, \dots, n\} \end{align}\]Here, constraints (1) assert that $\mathbf{v}$ is an eigenvector of $\mathbf{A}$, with eigenvalue 1. Constraints (2) and (5) ensure that $\mathbf{A}$ is left-stochastic (by enforcing column sums to equal 1, and entries to be between 0 and 1). Constraints (3) limit our search to tridiagonal matrices. Constraints (4) enforce the matrix to be irreducible, by guaranteeing all elements on the tridiagonal are at least $\epsilon > 0$.

Constraints (4) are a little bit of a hack. You don’t need all elements on the tridiagonal to be strictly positive in order to guarantee irreducibility, and so we are cutting off feasible solutions. However, I haven’t yet found a way to express the irreducibility with a linear constraint.

However, with $\epsilon$ set to 0.01, for a given $\mathbf{v}$, CPLEX was able to find a solution. I set a target distribution 5% for the bottom five leagues, 10% for the emerald and amythest, 20% for pearl, 30% for obsidian, and 5% for diamond. The resulting transition matrix then leads a random distribution towards the target. The convergence, however, is rather slow (it takes 3500 weeks to get there):

]]>