Reinforcement Learning is a machine learning technique that focuses on learning from interaction: agents interact with their environment according to their "action policy" and receive rewards for their actions. The objective of reinforcement learning algorithms is to learn the right action policies, i.e. to find policies that maximize the rewards the agents receive for their actions.

There are different approaches to reinforcement learning and we will use a simple one known as Q-Learning in this notebook.

In Q-Learning, each agent has a "Q-Table", which defines the expected total reward an agent will get for performing a particular action when in a particular state. Agents initially start with an "empty" Q-Table and then learn the right behavior through trial and error.

Agents start with an "empty" Q-Table and then fill it by learning through trial and error. At each step, an agent always chooses to perform the action that will lead to the highest reward.

The following paragraphs dive a little deeper into the mathematics behind reinforcement learning. This is quite illustrative because it shows how you can deal with uncertainty and also how to deal with (extrinsic) reward systems – two capabilities that are very important in any enterprise. Feel free to skip to the next chapter if you are not interested in mathematical formalism.

The following exposition of reinforcement learning closely follows that in the seminal book Reinforcement Learning by Richard S. Sutton and Andrew G. Barto.

In order to make reinforcement learning computable, we need to find an appropriate formalism. One such formalism is to model such agent-environment interactions as *Markov Decision Processes* (MDPs):

At each timestep $t$the agent receives some representation of the environments state,$S_t \in \mathcal{S}$, and on that basis selects an action, $A_t \in \mathcal{A}$. One time step later, (in part) as a consequence of its action, the agent receives a reward $R_t \in \mathcal{R} \subset \mathbb{R}$.

The interaction then gives rise to a *trajectory* that looks like this:

In a finite MDP, the sets of states, actions and rewards ($!$\mathcal{S}$!$, $!$\mathcal{A}$!$ and $!$\mathcal{R}$!$) have a finite number of elements and the random variables $!$R_t$!$ and $!$S_t$!$ have well defined probability distributions that depend only on the preceding state and action:

this is embeddd content:

$p\left(s',r \mid s,a\right) = Pr\{S_t=s', R_t=r \mid S_{t-1}=s, A_{t-1}=a\}$these are the bellman equations:

$\begin{aligned} v_\pi (s) & \doteq \mathbb{E}_\pi \lbrack G_t \mid S_t=s \rbrack \\ & = \mathbb{E}_{\pi}\lbrack R_{t+1}+\gamma G_{t+1} \mid S_t=s \rbrack \\ & = \mathbb{E}_{\pi}\lbrack R_{t+1} \mid S_t=s \rbrack +\gamma \mathbb{E}_{\pi}\lbrack G_{t+1} \mid S_t=s \rbrack \end{aligned}$Services