Draft status

- Goal: learn $v_\pi$ from episodes of experience under policy $\pi$
- $S_1, A_1, R_2, \dots, S_T \sim \pi$

- The
*return*is the total discounted reward:- $G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+2} + \dots$

- The value function is the expected return:
- $v_\pi (s) = \mathbb E_\pi[G_t \mid S_t=s]$

Monte-Carlo policy evaluation uses empirical mean return instead of expected return.

- to evaluate state $s$
- the
**first**time $t$ the state $s$ is visited in an episode,- increment counter: $N(s) \leftarrow N(s) +1$
- increment total return: $S(s) \leftarrow S(s) + G_t$
- value is estimated by mean return $v(s) = S(s)/N(s)$

- by law of large numbers, $V(s) \rightarrow v_\pi(s)$ as $N(s) \rightarrow \infty$

- to evaluate state $s$
**every**time $t$ the state $s$ is visited in an episode,- increment counter: $N(s) \leftarrow N(s) +1$
- increment total return: $S(s) \leftarrow S(s) + G_t$
- value is estimated by mean return $V(s) = S(s)/N(s)$

- by law of large numbers, $V(s) \rightarrow v_\pi(s)$ as $N(s) \rightarrow \infty$

In [1]:

```
import numpy as np
```

Table of environments https://github.com/openai/gym/wiki/Table-of-environments

In [1]:

```
import gym
env = gym.make('Blackjack-v0')
env.action_space
```

[2017-05-22 15:01:13,988] Making new env: Blackjack-v0

Out[1]:

Discrete(2)

There are two actions:

- 0: stay
- 1: hit

In [96]:

```
#random action
env.action_space.sample()
```

Out[96]:

1

In [2]:

```
help(env)
```

In [98]:

```
observation_space = env.observation_space.spaces
observation_space
```

Out[98]:

(Discrete(32), Discrete(11), Discrete(2))

Observation tuple for Blackjack:

- sum of players cards (ace counts 11)
- sum of dealers cards
- player has useable ace

In [99]:

```
# Resets the state of the environment and returns an initial observation.
observation = env.reset()
observation
```

Out[99]:

(14, 6, False)

In [100]:

```
# env.step:
# Run one timestep of the environment's dynamics. When end of
# episode is reached, you are responsible for calling `reset()`
# to reset this environment's state.
action = 0
observation, reward, done, info = env.step(action)
observation, reward, done, info
```

Out[100]:

((14, 6, False), 1.0, True, {})

For each observation we need an action:

In [101]:

```
# There are invalid states, but we don't care for convenience.
state_space_size = (33, 12, 2)
```

In [102]:

```
# simple policy:
policy = np.zeros(state_space_size, dtype=int)
#policy[:9,:,:]=1
```

In [103]:

```
def observation_clean(observation):
return (observation[0], observation[1], int(observation[2]))
observation = observation_clean(observation)
policy[observation]
```

Out[103]:

0

In [104]:

```
policy[observation]
```

Out[104]:

0

In [105]:

```
np.random.rand()
```

Out[105]:

0.6780772481806671

- Policy Evaluation
- Policy Improvement
- via $\pi(s) = \text{arg} \max_a q(s,a)$

Read: [SuBa] 5.1 Monte Carlo Policy Evaluation

Note: That the states are different in the OpenAI gym (here) and in [SuBa]:

- here an ace counts always 11 for the player. But dealer showing '1' means an ace.
- in [SuBa] only if there is no useable ace.

In [106]:

```
def run_episode(policy, env=env):
steps = []
observation = observation_clean(env.reset())
done = False
steps.append(((None, None)+ (observation, 0)))
while not done:
action = policy[observation]
observation_action = (observation, action)
observation, reward, done, info = env.step(action)
observation = observation_clean(observation)
steps.append(observation_action + (observation, int(reward)))
return steps # list of tupels: (s, a, s', R)
```

In [107]:

```
run_episode(policy)
```

Out[107]:

[(None, None, (13, 10, 1), 0), ((13, 10, 1), 0, (13, 10, 1), 1)]

In [108]:

```
gamma = 0.99
```

In [109]:

```
N = np.zeros(state_space_size, dtype=int)
S = np.zeros(state_space_size)
# every visit monte carlo:
nb_of_episodes = 100000
for e in range(nb_of_episodes):
observations_reward = run_episode(policy)
G = 0.
#print (observations_reward )
for o0, a, o, r in reversed(observations_reward):
G = r + gamma * G
N[o] += 1
S[o] += G
#print (o,r,G)
```

In [110]:

```
Gs = np.zeros(state_space_size)
Gs[N!=0] = S[N!=0]/N[N!=0]
```

In [111]:

```
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
%matplotlib inline
```

In [112]:

```
Gs[10, 3, 1]
```

Out[112]:

0.0

In [113]:

```
A = np.arange(1, 11)
B = np.arange(4, 22)
A, B = np.meshgrid(A, B)
V = Gs[4:22, 1:11, 0]
fig = plt.figure(figsize=(10,8))
ax = fig.gca(projection='3d')
surf = ax.plot_surface(A, B, V, rstride=1, cstride=1, cmap=cm.coolwarm,
linewidth=0, antialiased=False)
ax.set_ylabel("Player sum")
ax.set_xlabel("Dealer showing")
ax.set_title("No useable Ace")
fig.colorbar(surf, shrink=0.5, aspect=5)
plt.show()
```

In [114]:

```
A = np.arange(1, 11)
B = np.arange(12, 22)
A, B = np.meshgrid(A, B)
V = Gs[12:22, 1:11, 1]
fig = plt.figure(figsize=(10,8))
ax = fig.gca(projection='3d')
surf = ax.plot_surface(A, B, V, rstride=1, cstride=1, cmap=cm.coolwarm,
linewidth=0, antialiased=False)
ax.set_ylabel("Player sum")
ax.set_xlabel("Dealer showing")
ax.set_title("Useable Ace")
fig.colorbar(surf, shrink=0.5, aspect=5)
plt.show()
```

- Try
*Monte Carlo Policy Evaluation*with different number of epochs to get a feeling of the convergence properties. Take a look at the 3d-plots.

Read: [SuBa]: 5.3 Monte Carlo Control

Two assumptions for finding optimal policy:

- exploring starts ("first step of each episode starts at a state-action pair, and that every such pair has a nonzero probability of being selected as the start."[SuBa] 5.2 Monte Carlo Estimation of Action Values)
- infinite number of episodes

For Black Jack the "exploring starts"-condition can be satisfied, if we choose the first action randomly. So let's modify the
`run_episode`

-function:

In [115]:

```
def run_episode_exploring_start(policy, env=env):
steps = []
observation = observation_clean(env.reset())
done = False
steps.append(((None, None)+ (observation, 0)))
start = True
while not done:
if start:
action = np.random.binomial(n=1, p=0.5)
start = False
else:
action = policy[observation]
observation_action = (observation, action)
observation, reward, done, info = env.step(action)
observation = observation_clean(observation)
steps.append(observation_action + (observation, int(reward)))
return steps # list of tupels: (s, a, s', R)
```

In [116]:

```
size = list(state_space_size) + [2]
Q = np.zeros(size)
Q.shape
```

Out[116]:

(33, 12, 2, 2)

In [117]:

```
from collections import defaultdict
returns = defaultdict(list)
```

In [118]:

```
nb_of_episodes = 500000
def monte_carlo_optimal_policy(nb_of_episodes,
policy = np.random.binomial(n=1, p=0.5, size=state_space_size),
run_episode = run_episode_exploring_start):
for i in range(nb_of_episodes):
# (a) generate an episode using exploring starts and pi
observations_reward = run_episode(policy)
G = 0. # current return
# use a map for the first visit condition:
o_a = {} # map from states to (action, return)-Tuple
for o0, a, o, r in reversed(observations_reward):
G = r + gamma * G
o_a[o0] = a, G
#(b) for each pair (s,a) appearing in the episode
for o, (a, G) in o_a.items():
if o is not None:
returns[(o, a)].append(G)
re_mean = np.array(returns[(o, a)]).mean()
Q[(o) + (a,)] = re_mean
# for each s in the episode: optimize policy
policy[o] = np.argmax(Q[(o)])
return policy
policy = monte_carlo_optimal_policy(nb_of_episodes)
```

In [119]:

```
Q[5, 8, 0, 1]
```

Out[119]:

-0.15549859009756098

In [120]:

```
policy[5,5,0]
```

Out[120]:

1

In [121]:

```
import scipy.ndimage
def plot_policy_useable_ace(policy):
B = np.arange(4-0.5, 22-0.5, 0.2)
A = np.arange(1-0.5, 11-0.5, 0.2)
A, B = np.meshgrid(A, B)
Po = scipy.ndimage.zoom(policy[4:22, 1:11, 0], 5)
levels = range(-1,2)
plt.figure(figsize=(7,6))
CS = plt.contourf(A, B, Po, levels)
cbar = plt.colorbar(CS)
cbar.ax.set_ylabel('actions')
#plt.clabel(CS, inline=1, fontsize=10)
plt.title('Policy for no useable ace')
plt.xlabel("dealers showing")
plt.ylabel("Players sum")
_ = plt.xticks(range(1,11))
_ = plt.yticks(range(4,22))
```

In [122]:

```
# TODO improve plots
plot_policy_useable_ace(policy)
```

In [123]:

```
def plot_policy_no_useable_ace(policy):
B = np.arange(12-0.5, 22-0.5, 0.2)
A = np.arange(1-0.5, 11-0.5, 0.2)
A, B = np.meshgrid(A, B)
Po = scipy.ndimage.zoom(policy[12:22, 1:11, 1], 5, mode='nearest')
levels = range(-1,2)
plt.figure(figsize=(7,6))
CS = plt.contourf(A, B, Po, levels)
cbar = plt.colorbar(CS)
cbar.ax.set_ylabel('actions')
#plt.clabel(CS, inline=1, fontsize=10)
plt.title('Policy for useable ace')
plt.xlabel("dealers showing")
plt.ylabel("Players sum")
_ = plt.xticks(range(1,11))
_ = plt.yticks(range(12,22))
```

In [124]:

```
plot_policy_no_useable_ace(policy)
```

If the *exploring starts* condition can not be satisfied, because not every state $s \in \mathcal S$
can be a start state, we need to have an other strategy for exploration.

Read [SuBa] 5.4 On-Policy Monte Carlo Control

A simple method is *$\epsilon$-greedy*, i.e. with probability $\epsilon$ we choose a random action.
So the policy we are following is non-deterministic:
$
\pi(a \mid s)
$.

In the implementation the policy of the array `policy`

is chosen with probability $1-\epsilon/2$ and the alternative action with probability
$\epsilon/2$.

Note that for each state $s \in \mathcal S$: $ |\mathcal A(s)| = 2$.

Note that the final policy is not the deterministic policy stored in `policy`

. It's a soft $\epsilon$-greedy policy:
The actions should be taken from `policy`

with probabiliy $1-\epsilon/2$ and the alternative action with probability $\epsilon /2$.

In [125]:

```
def run_episode_epsilon_greedy(policy, epsilon=0.05, env=env):
steps = []
observation = observation_clean(env.reset())
done = False
steps.append(((None, None)+ (observation, 0)))
while not done:
if np.random.rand() > epsilon:
action = policy[observation]
else:
action = np.random.binomial(n=1, p=0.5)
observation_action = (observation, action)
observation, reward, done, info = env.step(action)
observation = observation_clean(observation)
steps.append(observation_action + (observation, int(reward)))
return steps # list of tupels: (s, a, s', R)
```

In [126]:

```
policy = monte_carlo_optimal_policy(nb_of_episodes, run_episode=run_episode_epsilon_greedy)
```

In [127]:

```
# action taken with probabilty: 1 - epsilon/2
plot_policy_useable_ace(policy)
```

In [128]:

```
#action taken with probabilty: 1 - epsilon/2
plot_policy_no_useable_ace(policy)
```

Goal:

- following a policy $\pi'$ (
*behavior policy*) - estimate for a different policy $\pi$: $v^\pi$ resp. $q^\pi$ (
*estimation policy*) - $\pi \neq \pi'$

require:

- if $\pi( a \mid s)>0$ then $\pi'( a \mid s)>0$, so we have episodes with action $a$ taken in states $s$.

Idea: "Importance sampling of the whole episodes", see [SuBa] 5.5 Evaluating One Policy While Following Another

So the $\epsilon$*-greedy* (*estimation policy*) policy can be evaluated and optimized
while following the *soft $\epsilon$-greedy* policy,
see [SuBa] 5.6 Off-Policy Monte Carlo Control.

- Update $v(s)$ incrementally after episode $S_1, A_1, R_2, \dots, S_T$
- For each state $S_t$ with return $G_t$
- $N(S_t) \leftarrow N(S_t) +1$
- $V(S_t) \leftarrow V(S_t) + \frac{1}{N(S_t)} (G_t - V(S_t))$

- In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes:
- $V(S_t) \leftarrow V(S_t) + \alpha (G_t - V(S_t))$

- Modify the code for first-visit MC policy evaluation and control to use the incremental implementation.

- [SuBa] Sutton, Richard S., and Andrew G. Barto. Introduction to reinforcement learning. Vol. 135. Cambridge: MIT Press, 1998.
- Szepesvári, Csaba. "Algorithms for reinforcement learning." Synthesis lectures on artificial intelligence and machine learning 4.1 (2010): 1-103.
- D. Silver. UCL Course on RL