Reinforcement Learning is a machine learning technique that focuses on learning from interaction: agents interact with their environment according to their "action policy" and receive rewards for their actions. The objective of reinforcement learning algorithms is to learn the right action policies, i.e. to find policies that maximize the rewards the agents receive for their actions.

There are different approaches to reinforcement learning and we will use a simple one known as Q-Learning in this notebook.

In Q-Learning, each agent has a "Q-Table", which defines the expected total reward an agent will get for performing a particular action when in a particular state. Agents initially start with an "empty" Q-Table and then learn the right behavior through trial and error.

Agents start with an "empty" Q-Table and then fill it by learning through trial and error. At each step, an agent always chooses to perform the action that will lead to the highest reward.

The following paragraphs dive a little deeper into the mathematics behind reinforcement learning. This is quite illustrative because it shows how you can deal with uncertainty and also how to deal with (extrinsic) reward systems – two capabilities that are very important in any enterprise. Feel free to skip to the next chapter if you are not interested in mathematical formalism.

The following exposition of reinforcement learning closely follows that in the seminal book Reinforcement Learning by Richard S. Sutton and Andrew G. Barto.

In order to make reinforcement learning computable, we need to find an appropriate formalism. One such formalism is to model such agent-environment interactions as *Markov Decision Processes* (MDPs):

At each timestep* *$!$t$!$ the agent receives some representation of the environments state, $!$S_t \in \mathcal{S}$!$, and on that basis selects an action, $!$A_t \in \mathcal{A}$!$. One time step later, (in part) as a consequence of its action, the agent receives a reward $!$R_t \in \mathcal{R} \subset \mathbb{R}$.

The interaction then gives rise to a *trajectory* that looks like this: $!$S_0,A_0,R_1,S_1,A_1,R_2,S_2,A_2,R_3,...$!$

In a finite MDP, the sets of states, actions and rewards ($!$\mathcal{S}$!$, $!$\mathcal{A}$!$ and $!$\mathcal{R}$!$) have a finite number of elements and the random variables $!$R_t$!$ and $!$S_t$!$ have well defined probability distributions that depend only on the preceding state and action:

$!$p\left(s',r \mid s,a\right) = Pr\{S_t=s', R_t=r \mid S_{t-1}=s, A_{t-1}=a\}$!$

for all $!$s',s \in \mathcal{S}, r \in \mathcal{R}$!$, and $!$a \in \mathcal{A}(s)$!$. The function $!$p$!$ defines the dynamics of the MDP which defines a probability distribution for each choice of $!$s$!$ and $!$a$!$:

$!$\sum_{s' \in \mathcal{S}} \sum_{r \in \mathcal{R}} p\left(s',r \mid s,a\right) = 1, \forall s \in \mathcal{S}, a \in \mathcal{A}(s)$!$

Using $!$p$!$, we can calculate anything we need to know about an agents behaviour, e.g the *state-transition probabilities*:

$!$p\left(s' \mid s,a\right) = Pr\{S_t=s'\mid S_{t-1}=s, A_{t-1}=a\}=\sum_{r \in \mathcal{R}} p\left(s',r \mid s,a\right)$!$

In particular, we can calculate the expected reward if we are in state $!$s$!$ and choose action $!$a$!$:

$!$r\left(s,a\right) = \mathbb{E} \lbrack R_t \mid S_{t-1}=s, A_{t-1}=a \rbrack = \sum_{r \in \mathcal{R}} r \sum_{s' \in \mathcal{S}} p\left(s',r \mid s,a\right)$!$

In order to get our agents to learn, we not only need to define the reward an agent gets for performing an action, but we also need to tell the agents to maximize the sum of rewards over its lifetime, i.e the lifetime reward.

Formally we can define the objective of reinforcement learning to maximize the expected lifetime reward, which is defined as:

$!$G_t \doteq R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3} +\dots = \sum_{k=0}^\infty \gamma^k R_{t+k+1} $!$

where $!$0 \leq \gamma \leq 1$!$ is the *discount rate* (which is needed to ensure $!$G$!$ converges in infinite MDPs).

Note that $!$G$!$ can also be defined recursively:

$!$G_t = R_{t+1}+\gamma G_{t+1}$!$

Now that we know we want to optimize the lifetime reward, we need to help our agent to learn an optimal policy. A *policy* is a mapping from states to the probability of selecting a possible action.

If the agent is following policy $!$\pi$!$ at time $!$t$!$, then $!$\pi(a \mid s)$!$ is the probability that $!$A_t=a$!$ if $!$S_t=s$!$.

Note that $!$\pi$!$ is distinct from the probability distribution $!$p$!$ defined above – the probability distribution $!$p$!$ gives us information about the environment, it tells us the reward an agent will get for performing certain actions and which state the agent will transition to; the probability distribution $!$\pi$!$ gives us information about the agent itself, it tells us the probability an agent will choose a particular action in a particular situation.

This is illustrated in the diagram below.

Given $!$\pi$!$ and $!$p$!$ we can calculate the probability an agent will transition into a particular state $!$s'$!$ given that it is in state $!$s$!$:

$!$p_\pi (s' \mid s) = \sum_a \pi(a \mid s) \sum_{r} p\left(s',r \mid s,a\right)$!$

and the expected reward when transitioning from state $!$s$!$ at time $!$t$!$:

$!$r_\pi (s) = \mathbb{E}_{\pi}\lbrack R_{t+1} \mid S_t=s \rbrack = \sum_{r} r \sum_a \pi(a \mid s) \sum_{s'} p\left(s',r \mid s,a\right)$!$

Using the action policy $!$pi$!$ and the environment dynamic $!$p$!$, we can now calculate the value we expect to achieve from the policy: The *value function* of a state $!$s$!$ under a policy $!$\pi$!$, denoted $!$v_\pi (s)$!$, is the expected return when starting in $!$s$!$ and following $!$\pi$!$ thereafter.

For our MDPs, we can define $!$v_\pi$!$ by:

$!$v_\pi (s) \doteq \mathbb{E}_{\pi}\lbrack G_t \mid S_t=s \rbrack = \mathbb{E}_{\pi}\lbrack \sum_{k=0}^\infty \gamma^k R_{t+k+1} \mid S_t=s \rbrack, \forall s \in S $!$

We can then define value of taking action $!$a$!$ in state $!$s$!$ under a policy $!$\pi$!$ as follows:

$!$q_\pi (s,a) \doteq \mathbb{E}_{\pi}\lbrack G_t \mid S_t=s, A_t=a \rbrack = \mathbb{E}_{\pi}\lbrack \sum_{k=0}^\infty \gamma^k R_{t+k+1} \mid S_t=s, A_t=a \rbrack, \forall s \in S, a \in A$!$

$!$q_\pi$!$ is called the *action value function* for policy $!$\pi$!$. It represents the expected value of our policy if we start in state $!$s$!$, then take action $!$a$!$ and follow the policy $!$\pi$!$ thereafter. The important point here is that $!$q_\pi$!$ not only captures the immediate reward we get for our next action, but also the expected reward the agent will get thereafter.

Given the recursive nature of the lifetime reward, we can answer the question of how the value of an agents current state is related to the value of its future state if it follows a policy and takes an action now: after all, once we have transitioned from $!$s$!$ to $!$s'$!$, nothing changes from a conceptual perspective.

$!$\begin{aligned}
v_\pi (s) & \doteq \mathbb{E}_\pi \lbrack G_t \mid S_t=s \rbrack \\
& = \mathbb{E}_{\pi}\lbrack R_{t+1}+\gamma G_{t+1} \mid S_t=s \rbrack \\
& = \mathbb{E}_{\pi}\lbrack R_{t+1} \mid S_t=s \rbrack +\gamma \mathbb{E}_{\pi}\lbrack G_{t+1} \mid S_t=s \rbrack
\end{aligned}$!$

The first summand here is simply the reward we expect from taking an action and moving from $!$s$!$ to $!$s'$!$, more explicitly:

$!$r_\pi (s) = \mathbb{E}_{\pi}\lbrack R_{t+1} \mid S_t=s \rbrack = \sum_{r} r \sum_a \pi(a \mid s) \sum_{s'} p\left(s',r \mid s,a\right)$!$

The second part is slightly more complicated – it defines the expected lifetime reward from time $!$t+1$!$ but starting at time $!$t$!$ in state $!$s$!$. To calculate this we need to calculate the value function for each potential future state $!$s'$!$, weighted by the probability of actually getting to that state:

$!$\mathbb{E}_{\pi}\lbrack G_{t+1} \mid S_t=s \rbrack = \sum_{s'} v_\pi (s') p_\pi (s' \mid s) = \sum_{s'} v_\pi (s') \sum_a \pi(a \mid s) \sum_{r} p\left(s',r \mid s,a\right) $!$

The second part of the equation is visualized in the diagram below.

Putting these terms together and rearranging the sums, we end up with:

$!$v_\pi (s) = \sum_a \pi(a \mid s)\sum_{s'} \sum_{r} p\left(s',r \mid s,a\right)\lbrack r+\gamma v_\pi (s')\rbrack $!$

This is known as the *Bellman equation* for $!$v_\pi$!$. It expresses a relationship between the value of a state and the values of successor states: The value of the start state must equal the (discounted) value of the expected next state, plus the reward expected in getting to that state.

Now that we know about policies and value functions, we can try looking for the best possible, optimal value function. i.e. the one that achieves the highest reward over the long run.

Using value functions, we can define an ordering of policies: $!$\pi \geq \pi'$!$ if and only if $!$v_\pi (s) \geq v_{\pi'} (s)$!$ for all $!$s \in \mathcal{S}$!$.

There will always be at least one policy that is better than or equal to all other policies – this is an *optimal policy*. Let's denote this by $!$\pi_*$!$, which has the *optimal state-value function* $!$v_*$!$, defined as:

$!$v_* (s) \doteq \max\limits_\pi v_\pi (s)$!$

Optimal policies share the same *optimal action-value function* $!$q_*$!$, defined as:

$!$\begin{aligned}
q_* (s) & \doteq \max\limits_\pi q_\pi (s,a) \\
& = \mathbb{E}\lbrack R_{t+1}+\gamma v_* (S_{t+1}) \mid S_t=s, A_t=a \rbrack
\end{aligned}$!$

Because $!$v_*$!$ is the value function for a policy, it must satisfy the Bellman equation. However, because it is the optimal value function, we can actually write this in a special form without reference to any specific policy. This form is called the *Bellman optimality equation*:

$!$\begin{aligned}
v_* (s) & = \max\limits_\pi q_\pi (s,a) \\
& = \max\limits_a \mathbb{E}_{\pi_*}\lbrack R_{t+1}+\gamma G_{t+1} \mid S_t=s, A_t=a \rbrack \\
& = \max\limits_a \mathbb{E}_{\pi_*}\lbrack R_{t+1}+\gamma v_* (S_{t+1}) \mid S_t=s, A_t=a \rbrack \\
&= \max\limits_a \sum_{s', r} p\left(s',r \mid s,a\right)\lbrack r+\gamma v_* (s')\rbrack
\end{aligned}$!$

Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state.

The Bellman optimality equation for $!$q_*$!$ is:

$!$\begin{aligned}
q_* (s,a) & = \mathbb{E}\lbrack R_{t+1}+\gamma \max\limits_{a'} q_* (S_{t+1},a') \mid S_t=s, A_t=a \rbrack \\
&= \sum_{s', r} p\left(s',r \mid s,a\right)\lbrack r+\gamma \max\limits_{a'} q_* (s',a')\rbrack
\end{aligned}$!$

Reinforcement learning then essentially boils down to solving the Bellman optimality equation. For finite MDPs, such a solution will always exist - because the Bellman optimality equation defines a system of equations, once for each state. So if there are $!$n$!$ states, you end up with $!$n$!$ equations in $!$n$!$ unknowns.

Once you have $!$v_*$!$, it is quite easy to derive the optimal policy: for each state $!$s$!$, there will be at least one state for which the maximum is obtained in the Bellman optimality equation. Any policy that assigns nonzero probability only to these actions is an optimal policy.

If you have $!$q_*$!$, choosing an optimal policy is even easier - you simply have to find an action that maximises $!$q_* (s,a)$!$:

$!$\begin{aligned}
\pi_* (a \mid s) & > 0, a \in argmax_a q_*(s,a) \\
\pi_* (a \mid s) & = 0, a \notin argmax_a q_*(s,a)
\end{aligned}$!$

Because the state space can be extremely large, it is rarely practical to find an exact solution to $!$v_*$!$ or $!$q_*$!$. Fortunately, a number of approaches exist to finding approximate solutions.

One way of estimating $!$v$!$ is to use a Monte Carlo approach: we start with an initial approximation $!$V$!$ for $!$v$!$ (i.e. for the expected total reward following that state) and then run through our simulation a large number of times - each time we run the simulation we will wander through a different trajectory defined by $!$pi_*$!$ and the environment dynamic $!$p$!$.

During each run, we cache the actual value for $!$G_t$!$ (i.e. the actual total reward following that state) for each state and timestep. At the end of the run we update $!$V(S_t)$!$ using the following formula, where $!$\alpha$!$ is a constant that defines how quickly we move towards the new value:

$!$V(S_t) \gets V(S_t) + \alpha \lbrack G_t - V(S_t) \rbrack$!$

The Monte Carlo approach works fine, but it does have one issue: you need to wait until the end of each simulation to actually calculate $!$G_t$!$, which means this process is very slow.

Fortunately, there is a much faster approach, which allows us to update $!$V$!$ after each timestep. This approach works thanks to the recursive nature of the Bellman equation.

Remembering that $!$G_t = R_{t+1}+\gamma G_{t+1}$!$, we can use the following formula to update $!$V$!$ after every timestep:

$!$V(S_t) \gets V(S_t) + \alpha \lbrack R_{t+1}+\gamma V(S_{t+1}) - V(S_t) \rbrack$!$

This approach to learning is referred to as Temporal Difference Learning because it compares values of $!$V$!$ that are separated by a (short) temporal difference.

In this notebook, we focus on a variant of Temporal Difference Learning known as Q-Learning because it finds a solution for $!$q_*$!$ instead of $!$v_*$!$:

$!$Q(S_t,A_t) \gets Q(S_t,A_t) + \alpha \lbrack R_{t+1}+\gamma \max\limits_a Q(S_{t+1},a) - Q(S_t,A_t) \rbrack$!$

Given this, we can define the basic Q-Learning algorithm as follows:

Start with an initial (random) $!$Q(s,a)$!$, ensuring that $!$Q(terminal state,.) = 0$!$

Run the simulation a large number of times (episodes), with the following loop for each episode:

- Initialize S
- Loop for each step of the episode:
- Choose A from S using $!$\epsilon$!$-greedy policy based on Q
- Take action A, observe R, S'
- $!$Q(S_t,A_t) \gets Q(S_t,A_t) + \alpha \lbrack R_{t+1}+\gamma \max\limits_a Q(S_{t+1},a) - Q(S_t,A_t) \rbrack$!$

$!$* S \gets S'$!$ - Until S is terminal or simulation reaches a predefined end

We use the $!$epsilon$!$-greedy policy in step 2.A to ensure that our algorithm doesn't get locked into a local optimum: using a small $!$\epsilon > 0$!$ that defines a probability that we will choose a random action instead of one defined by our policy ($!$\epsilon$!$-greedy approach).

The step size $!$\alpha \in (0, 1\rbrack$!$ defines how quickly we move from our current value of $!$Q$!$ to a new value and $!$\gamma$!$ is the discounting factor.

Now let's apply the theory above to our beer game agents. Our "learning" consists of finding the right values for the function $!$Q(S,A)$!$, which tells the agent which action to take if it is in State A.

We set $!$\epsilon=0.1$!$ and $!$\alpha=0.2$!$. Because the beer game runs over a finite set of 24 rounds, we can set $!$\gamma=1$!$.

We store $!$Q$!$ in a *Q-Table*, which is essentially a matrix indexed by the state of the agent and possible actions. For each entry, we store the value $!$Q(S,A)$!$, which is our approximation of the optimal value $!$q(s,a)$!$, i.e. the value we expect to achieve if we are in state $!$s$!$ and choose action $!$a$!$.

Because the number of states and actions is potentially infinite (an agent can order any amount of beer), we use a sparse Q-table, i.e. a table that is empty and just stores the values that differ from the default value. Our implementation of the Q-Table `sparseQTable` stores values in a dictionary whose key is defined by the tuple $!$(s,a)$!$. For the beer game, we set the default value to 0, i.e. we expect 0 value from an action we have never tried before.

Next to adding and reading values, the Q-table provides two key methods:

This method returns the best action for a given state, i.e. the action that maximizes $!$Q(S,A)$!$ for a given state $!$S$!$.**best_action (state)**This method returns the best value that can be achieved for a given state, i.e. $!$Q(S, best\_action(S))$!$.**max_action_value (state)**

Each supply chain agent has its own Q-table and we capture the agent's state via a single value, namely the **order balance** introduced above.

Services