A slot machine works on a variable or random schedule of reinforcement. The gambler never knows when he/she will be rewarded but it can happen any time after he/she pulls the handle. The reinforcement varies in the amount of money given and in the frequency of the delivery of the money.
Chapter 2 — Multi-arm Bandits
In the previous chapter we’ve seen the framework of how the agent interact with the environment by using an action, and the environment sends back states and rewards. In this chapter we’ll look at the different ways in which the environment can respond to the agent’s actions.
In particular, this chapter discuss the evaluative aspect of RL in a simplified setting, one that involves learning to act in just one situation (e.g. k-armed bandits & contextual bandits).
We’ll also discuss the exploration-exploitation tradeoff, and how we can learn value functions through:
– Greedy action selection
– ε-greedy action selection
– UCB (Upper confidence Bound)
– Gradient bandits
The N-Armed Bandit Problem
In this problem, imagine we have “N” slot machines as pictured above (just N instead of 3) in a certain casino. Our goal is to get the maximum reward playing in that casino over some period of time (say, 30 actions). Since we don’t know the probability of success in each slot machine, we will just attempt playing in each one, and see the reward that it will give. Based on the reward that we will get from each machine, we will adjust our understanding of the expected value of that slot machine.
So how does that relates to our RL problem?
The agent sends an action (choosing which slot machine to play ).
The environment sends back a reward.
In the case of the N-armed bandit the state will be the same state that we started with, and will be sent to the agent every time.
Stationary vs non-stationary rewards
An interesting thing to note, is that we assumed that our rewards distribution from each slot machine is stationary.
A stationary reward distribution means that playing machine A will always have the same expected reward.
A non-stationary reward distribution means that the reward distribution changes over time. You can see that when you play against another player, say, in chess. Even though you can do the same moves, the other player may catch up and will not act similarly (they will changes their strategy). Therefore the reward distribution of making that move is non-stationary.
Example:
Imagine that the 3 slot machine above are called A, B, C, and that we played in each 10 times.
We found that:
A gives 10 coins after 10 games.
B gives 5 coins after 10 games.
C gives 1 coin after 10 games.
So how do we find the best slot machine?
A simple way of going about it will be to have an estimation of each machine’s reward, as the sum of the rewards taken when you played that machine, over the number of times you played that machine:
Where t is just a time step (i.e. the count of how many actions you did so far).
In our example, we got 10 coins when we played A over 10 games.
So Q(A) = 10/10 = 1
Note that if we knew in advance that expected value of each slot machine then it would be pretty easy to solve the N-armed bandit problem: we would just select the machine with the highest value.
Exploiting & exploring
We don’t know the true value of each machine, but we may have estimates.
If we keep estimates of the all the action values, then at any point at least one action value should be the greatest.
Taking that action is called a greedy action.
If we select a greedy action, we say that we are exploiting our current knowledge. But if instead we decide to select one of the non-greedy actions, then we say that we are exploring, because this enables us to improve our estimate of the non-greedy action’s value.
More formally, taking the greedy action:
ε-greedy action selection
In a greedy selection, we always choose the action with the highest value. But what will happen if instead we decide to sometimes explore another machines?
That’s exactly what ε-greedy action selection is.
It means that you almost always select the greedy action, except for occasionally, with probability “ε”, you select a random action with a uniform probability.
Example: if choosing machine A is our greedy action, then with probability “ε” we choose to play either B or C. Uniform probability means that machines B and C are equally likely to be chosen (i.e. 50% B, 50% C).
So as the number of actions we take increases, with the ε-greedy policy every action will be sampled an infinite number of times. And this Qt(a), which is our value function of an action “a” at a given time step “t”, will converge to
“q*(a)”.
When you see this “q*(a)” in RL it just means the optimal value function of that action.
A problem: how should we keep calculating the estimates?
An easy answer is to maintain for each action “a” (e.g. playing machine A), a record of all the rewards we received choosing that action. Then, when the estimate of the value of action “a” is needed at time “t”, it can be computed according to

Where R1, . . . , RNt(a) are all the rewards received following all selections of action a prior to play t.
However, as we want to play infinite amount of times, the memory and computational requirements grow without bound.
Instead, let Q_k be the estimate for a machine’s k-th reward (the average of its first k − 1 rewards). Given this average and a k-th reward, R_k, then the average of all k rewards can be computed by
Where in the first line we sum all the rewards that we got (10 in the example before), and divide by the number of plays (10 before).
Following the formulation we get that in order to find the next value function for an action (Q_k+1), we just need the previous action value (Q_k), a count of how many rewards we had in that action (“k”), and the previous reward (R_k).
We don’t need to keep a count on all previous rewards anymore!
As an example, if the average over 5 games is 8, we only need to keep the number 5 and the total reward of 40 to update the average in the future (we don’t need to keep all the rewards that sum up to 40 separately).
The general form of the equation is

The expression “Target − OldEstimate” is an error in the estimate. It is reduced by taking a step toward the “Target”.
So say we have 5 samples as above, and we’re taking another action and getting another reward of 12. Our step size is now 1/6 (since we had 5 samples and we added another sample).
Our new average will be:
8 + 1/6 *[12–8 ] = 8.66
Greedy vs ε-greedy
There are different situations in which the greedy algorithm is advantageous over the epsilon greedy.
In cases where there is no variance in the reward, the greedy only needs to take the action once to understand the reward that it will get taking that action. ε-greedy on the other hand, do much better when there is noisy rewards (i.e. the reward is changing).
For example:
Taking our 3 slot machines from before A,B,C.
In situation 1 — reward doesn’t change: We just need to play each machine once to know which one is best. Hence, there is no need to keep exploring (e.g. using ε-greedy).
In situation 2 — rewards change: imagine playing each machine once and getting the rewards: A=0, B=0, C=1.
The greedy selection will always take C since it has the highest value.
But what if in the next time we play the rewards will be
A=100, B=5, C=1?
Clearly, greedy will not perform as well here. That’s why we should explore (e.g. using ε-greedy).
So a simple bandit algorithm looks as follows:
Where in every step we either take the action with the maximum value (argmax) with prob. 1-ε, or taking a random action with prob. ε.
We observe the reward that we get (R).
Increase the count of that action by 1 (N(A)).
And then update our sample average for that action (Q(A)).
Non stationary problems
The averaging method we talked about until now are appropriate in a stationary environments, but not if the bandit is changing over time (non stationary problems). When we deal with non stationary problems, it makes sense to care more about recent rewards than long-past ones.
One of the most popular ways of doing this is to use a constant step-size parameter
where the step-size parameter α ∈ (0, 1] is constant.
“α” tells us how much to weight (you can think about it as how much we “care” about) the next update of the value function. As we move forward in time, “α” will make sure that a value function will be updated less for a given change in the rewards.
This gives us Qk+1, which is now the sum of the initial estimate Q1, and a weighted average of the past rewards.

Reinforcement Machine Learning Algorithm
This is called the exponential recency-weighted average.
Note that using this constant step-size parameter, we decrease any new reward “Ri” exponentially, according to the exponent on 1 − α.
UCB (Upper confidence Bound)
UCB is another technique for balancing exploration and exploitation.
Remember that ε-greedy selects an action with a uniform prob. after deciding not to take the greedy action.
But is there a better technique than this uniform distribution selection?
Yes, there is. We can weight actions based on whether they’re almost greedy, or really if they haven’t been tested much
In the above image, we have our estimate of the action “Qt(a)” — the “exploit” part.
We then add the UCB term (“ c*sqr(ln(t)/Nt) ”) — the “explore” part.
“c” weights the balance between the 2 terms
ln(t)/N_t is a metric for how uncertain we are about our estimate Qt(a)
UCB however is more difficult than ε-greedy to extend to a more general RL problem.
Gradient bandits
Another way to balance exploration and exploitation is the gradient bandit algorithm.
So far we considered methods that estimate action values and then use those to select other actions. This is often a good approach, but it is not the only one possible. We can also consider learning some preference Ht(a) (which is just a value) for each action a. The larger the preference, the more often that action is taken.
Note that the preference has no interpretation in terms of reward. Only the relative preference of one action over another is important.
The action probabilities are determined according to a soft-max distribution (i.e. the prob. of taking the actions all sum up to 1):
Note the new notation π_t(a) = the probability of taking action “a” at time t.
When we’re doing this, we update it with stochastic gradient descent
Taking our current preference of a state H_t(a), and update it based on the reward received (R_t), minus the average reward (R_hat_t), times the probability of taking that action π_t(a).
Reinforcement Learning Model
Associative Search (Contextual Bandits)
So far we only considered non-associative tasks, in which there is no need to associate different actions with different situations.
Imagine however, that there are several different N-armed bandit problems, and that on each play you randomly confront one.
Now suppose, that when a bandit problem is selected, you are given a clue about which problem it is (e.g. maybe the slot machine changes colors).
Using that clue you can learn a policy to associate each task with the color you see.
That’s the general idea.
Slot Machine Reinforcement Learning Definition
Until next time,
Slot Machine Reinforcement Learning Techniques
Cheers.