• Stochastic Policy
• Python implementation for the openAI gym Cartpole Example (discrete action space)

(Parametrized) stochastic policy:

$$\pi_\theta(s, a) = p(a \mid s; \theta)$$

Goal: Maximization of the expected reward of the policy by learning of the parameters $\theta$.

$$\nabla_\theta \mathbb E(R)$$
• $R$: reward function

The expectation of a function $f(X)$ is defined as: $$\mathbb E_{X \sim p(X\mid \theta)} [f(X)] = \int_\mathcal X f(X) p(X\mid \theta) dX$$

If we want to compute the gradient of the expectation we can rearrange the formula: $$\nabla_\theta \mathbb E_{X\sim p(X\mid \theta)} [f(X)] = \nabla_\theta \left( \int_\mathcal X f(X) p(X\mid \theta) dX \right) =\\ \int_\mathcal X f(X) \nabla_\theta \left( p(X\mid \theta) \right) dX =\\ \int_\mathcal X f(X) p(X\mid \theta) \frac{\nabla_\theta \left( p(X\mid \theta) \right) }{p(X\mid \theta)}dX =\\ \int_\mathcal X f(X) p(X\mid \theta)\nabla_\theta \left(\log p(X\mid \theta) \right) dX =\\ \mathbb E_{X\sim p(X\mid \theta)} \left[f(X) \nabla_\theta \left(\log p(X\mid \theta) \right) \right]$$

#### Score function¶

$$\nabla_\theta \pi_\theta(s,a) = \pi_\theta(s,a) \frac{\nabla_\theta \pi_\theta(s,a)}{\pi_\theta(s,a)} = \pi_\theta(s,a) \nabla_\theta \log \pi_\theta(s,a)$$

The score function is: $\nabla_\theta \log \pi_\theta(s,a)$

The score function describes how sensitive the stocastic policy $\pi$ is to changes in $\theta$ for a fixed $\theta$.

#### Softmax Policy¶

• Discrete action space.

Linear model for th unnormalized log-probability: $\phi(s,a)^T \theta$,
i.e. weighting of the actions using a linear combination of features $\phi(s,a)$.

Exponentiation and normalization gives the probability for an action $a$ in state $s$:

$$\pi_\theta (s, a) = \frac{ e^{\phi(s,a)^T \theta }}{\sum_{a' \in \mathcal A} e^{\phi(s,a')^T \theta} }$$

The score function for softmax policy:

$$\nabla_\theta \log \pi_\theta(s,a) = \phi(s,a) - \mathbb E_{\pi_\theta}[\phi(s,\cdot )]$$

### One-Step MDP¶

• starting in state $s \sim d(s)$
• terminating after one step with reward $R \sim \mathcal R_{s,a}$
$$J(\theta) = \mathbb{E}[R] = \sum_{s\in \mathcal S} d(s)\sum_{a\in \mathcal A} \pi_\theta(s,a)\mathcal R_{s,a}$$$$\nabla_\theta J(\theta) = \sum_{s\in \mathcal S} d(s) \sum_{a\in \mathcal A} \pi_\theta(s,a) \nabla_\theta \log \pi_\theta(s,a) \mathcal R_{s,a} = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) R]$$

• For multiple step MDPs the Policy Gradient Theorem generalizes the likelihood ratio approach.
• replace instantaneous reward $R$ with long-term value $Q^{\pi_\theta}(s,a)$

For any differentiable policy $\pi_\theta(s,a)$,
for any of the policy objective functions $J = J_1, J_{avgR}$, or $\frac{1}{1-\gamma}J_{avgV}$ the policy gradient is (proof see appendix of [Su00]):

$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q^{\pi_\theta}(s,a)]$$

Practical value of the theorem: the computation of the policy gradient reduces to a simple expectation.

### REINFORCE (Monte Carlo Policy Gradient)¶

Estimate the state-action value $Q^{\pi_\theta}(s, a)$ by the discounted reward (return) of the state $G_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \dots + \gamma^{T-t} R_{T}$

The estimated gradient for one episode is $$\nabla_\theta J(\theta) \approx \sum_{t=1}^T G_t \nabla_\theta \log \pi_\theta(s, a)$$

In [9]:
import numpy as np
import numpy.testing as npt

import matplotlib.pyplot as plt
%matplotlib inline

In [10]:
import gym
env = gym.make('CartPole-v0')

[2017-01-05 11:24:37,138] Making new env: CartPole-v0


Here we have just zwo actions "left"($a=0$) and "right"($a=1$), so we can use the sigmoid (instead of the softmax) for

$$\pi_\theta (s, a=1) = \text{sigmoid}(\vec \phi(\vec s) \vec \theta + b)$$

Here we don not use (special) features. We just use the states as features: $$\vec \phi(\vec s) = \vec s$$

In [11]:
def sigmoid(z):
return 1/(1 + np.exp(-z))

In [12]:
def augmented_states(states):
aug = np.zeros
if len(states.shape)==2:
return np.concatenate((aug(states.shape[0]).reshape(-1,1), states), axis=1)
else:
return np.concatenate((aug(1), states))

In [29]:
def get_p(theta, states):
if len(theta)==5:
states_ = augmented_states(states) # for bias b (bias trick)
else:
states_ = states
return sigmoid(np.dot(states_, theta))

In [14]:
def get_action(theta, observation):
p =  get_p(theta, observation)
return np.random.binomial(n=1, p=p, size=1)[0]

In [15]:
def roll_out(theta, max_episode_length=10000):
actions = []
states = []
observation = env.reset()
rewards = []
done = False
i = 0
while not done: # TODO max epoch length
action = get_action(theta, observation)
states.append(observation)
actions.append(action)
observation, reward, done, info = env.step(action)
rewards.append(reward)

i+=1
if i>max_episode_length:
break

return np.array(rewards), np.array(states), np.array(actions)

#roll_out(theta)

In [16]:
n=5
theta = np.random.randn(n)

In [17]:
rewards, states, actions = roll_out(theta)
#rewards, states, actions

In [18]:
def get_discounted_sum_of_rewards(rewards, gamma=.99):
c = 0.
for i, r in enumerate(rewards[::-1]):
c = r + gamma * c

get_discounted_sum_of_rewards([0., 0., 0., 1.])

Out[18]:
array([ 0.970299,  0.9801  ,  0.99    ,  1.      ])
$$J(\theta) \approx \sum_{t=1}^{T} \pi(a_t \mid s_t, \theta) A_t$$

with

• $A_t$: advantage

Derivative of the sigmoid:

$$\frac{d}{dx}\text{sigmoid}(x)=\text{sigmoid}(x)(1-\text{sigmoid}(x))$$

Here: $$\pi_\theta(\vec s, a=1) = \text{sigmoid}(\vec s^T \vec \theta)$$

$$\pi_\theta(\vec s, a=0) = 1- \text{sigmoid}(\vec s^T \vec \theta)$$$$\nabla_\theta \log \pi_\theta(\vec s, a=1) = \frac{1}{\text{sigmoid}(\vec s^T\vec \theta)}\text{sigmoid}(\vec s^T\vec \theta)(1-\text{sigmoid}(\vec s^T \vec \theta)) \vec s = (1-\pi_\theta(\vec s, a=1)) \vec s$$$$\nabla_\theta \log \pi_\theta(\vec s, a=0) = - \frac{1}{1- \text{sigmoid}(\vec s^T\vec \theta)} \text{sigmoid}(\vec s^T\vec \theta)(1-\text{sigmoid}(\vec s^T\vec \theta)) = - \pi_\theta(\vec s, a=1) \vec s$$
In [32]:
def get_delta_theta(actions, pi, states, advantages, alpha):
p = get_p(theta, states)
states_ = augmented_states(states)
grad = advantages * (-(1. - actions) * p + (actions * (1. - p) )) * states_.T

# clip=25.


In [60]:
theta = np.random.randn(n)

nb_rollouts = 10001

b_rewards = 0.
ii=0
alpha_ = alpha = 0.001

for i in range(nb_rollouts):
rewards, states, actions = roll_out(theta)

if i%10==0: # learning rate decay
alpha =  alpha_ * 0.9

delta_theta = get_delta_theta(actions, theta, states, adv, alpha)
theta += delta_theta

# just for progress printing
b_rewards = 1/(ii+1) * (rewards.sum() + ii * b_rewards)
ii+=1
if i%250==0:
print (i, theta, b_rewards)
b_rewards = 0.
ii=0

0 [-0.17344432  0.64746428  0.70026023  1.22031105  0.34207346] 18.0
250 [-0.17344432 -0.13518532  2.09379432  3.41064048  4.84379737] 124.816
500 [-0.17344432 -3.01573871  2.98143799  5.86882375  6.95513493] 298.428
750 [ -0.17344432   1.70650257   2.99633254  10.31686178  12.2067736 ] 728.516
1000 [ -0.17344432  -0.77927747   1.34002176  11.08589519  12.72645203] 296.244
1250 [ -0.17344432  -1.39062293   4.61079084  13.38255253  15.50965289] 2276.736
1500 [ -0.17344432   2.29439042   1.02429169  13.92594869  17.82305295] 620.688
1750 [ -0.17344432  -0.99215483   0.8042833   14.27091883  18.84466158] 313.4
2000 [ -0.17344432   1.86845107  -0.49193008  14.37816835  18.93449913] 358.656
2250 [ -0.17344432  -2.64385232   2.22809242  14.89598477  18.34140148] 315.512
2500 [ -0.17344432  -3.63304082   2.40439079  14.8589223   18.36400163] 232.54
2750 [ -0.17344432   1.19635986   7.71113422  15.54115281  19.31382173] 1786.332
3000 [ -0.17344432   0.40614209   6.66550218  18.37557547  20.83997478] 2913.464
3250 [ -0.17344432  -6.79415416   4.4958923   18.32622474  20.90250196] 349.152
3500 [ -0.17344432  -5.64888397   6.21307541  18.38975317  20.8351785 ] 255.148
3750 [ -0.17344432  -3.83104739   7.28885398  18.75303348  22.29447309] 450.168
4000 [ -0.17344432   1.56367768  11.59714601  18.82763179  20.53774896] 1609.012
4250 [ -0.17344432  -3.82586597  15.20375561  20.24393617  21.76375826] 7400.452
4500 [ -0.17344432   1.50381509   8.64879678  20.89032646  27.73195129] 4161.032
4750 [ -0.17344432   4.14080306  11.87169708  20.81760842  28.41168812] 1280.944
5000 [ -0.17344432   3.13668488  14.04795733  21.11576592  33.14025838] 6680.832
5250 [ -0.17344432   1.80384902  16.65673573  22.24396779  32.96400492] 8052.008
5500 [ -0.17344432   4.28514843  21.36812508  22.5697242   32.85923548] 9307.632
5750 [ -0.17344432   1.85217412  17.554174    24.13065606  39.51369869] 7623.588
6000 [ -0.17344432   1.99825464  23.45130849  24.60408862  37.42754604] 9850.584
6250 [ -0.17344432   0.59196073  35.33622866  26.41746464  28.19440256] 5888.216
6500 [ -0.17344432   0.66762881  25.02187502  26.30635793  42.16254313] 8902.704
6750 [ -0.17344432  -6.68060115  24.34457221  28.14226668  40.63795847] 4903.96
7000 [ -0.17344432   1.10764232  28.63616497  28.7667267   37.68566725] 4599.712
7250 [ -0.17344432   2.50274902  28.00571915  28.343935    39.24390679] 10001.0
7500 [ -0.17344432   4.3386294   31.28127639  28.61524992  39.02199538] 9650.452
7750 [ -0.17344432   4.36962477  27.82249491  28.84423521  40.98967973] 10001.0
8000 [ -0.17344432  -0.20095448  32.69302076  32.58432046  36.21854967] 7287.032
8250 [ -0.17344432   3.67461367  34.04059165  30.59776953  37.98667424] 10001.0
8500 [ -0.17344432   2.26556557  29.52227394  32.85573795  41.6915911 ] 9988.384
8750 [ -0.17344432   5.50582341  26.75623069  33.40331956  43.65194207] 10001.0
9000 [ -0.17344432   6.84509867  32.38648962  33.4425614   43.27093226] 8263.116
9250 [ -0.17344432   1.80252043  18.32346     35.26901269  52.02101994] 10001.0
9500 [ -0.17344432   2.15276574  17.90183915  35.88422216  51.82852184] 9741.276
9750 [ -0.17344432   2.81284184  10.58769449  36.5159464   53.94835712] 8047.124
10000 [ -0.17344432   2.69117313   2.45619903  36.86371001  54.41119968] 2177.88


After improvement the average accumulated reward drops sometimes suddenly probably due to high variance.

In [19]:
def render(theta, max_episode_length=5000):
observation  = env.reset()
for t in range(max_episode_length):
env.render()
action = get_action(theta, observation)
#print ("random action:", action)
observation, reward, done, info = env.step(action)
# observation =
#  [position of cart, velocity of cart, angle of pole, angular velocity of pole]
#print (observation, reward, done, info)
if done:
print("Episode finished after {} timesteps".format(t+1))
break

env.render(close=True)

render(theta)

Episode finished after 362 timesteps


### REINFORCE with Baseline¶

for reducing variance

REINFORCE has very high variance.

One recipe for reducing the variance is to substract a baseline (also called control variate) from the return. The baseline have to have zero mean, so it will not effect the expectation.

As baseline the state value $v(s_t)$ function is used (see [Su00]).

The estimated gradient for one episode is $$\nabla_\theta J(\theta) \approx \sum_{t=1}^T (G_t - v(s_t)) \nabla_\theta \log \pi_\theta(s, a)$$

This can be pluged in an optimization algorithm (e.g. stochastic gradient descent) for optimizing the policy.

$$A_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \dots + \gamma^{T-t} R_{T}- v(s_t)$$

As baseline we use the value function of the state $s_t$. So we must estimate the baseline, too.

#### Linear approximation of the value function $v(\vec s)$¶

$$v_w(\vec s) = \vec s \cdot \vec w_{1:} + w_0$$

or with $\vec s' = (1, s_1, s_2, \dots , s_n)$ and $\vec w^T = (w_0, w_{1}, \dots w_n)$

$$v_w(\vec s) = \vec s' \cdot \vec w$$

Note: The cartpole example also works without the bias term $w_0$, which would simplyfy the code.

In [61]:
def augmented_states_(states, aug = np.ones):
states = np.abs(states)
if len(states.shape)==2:
return np.concatenate((aug(states.shape[0]).reshape(-1,1), states), axis=1)
else:
return np.concatenate((aug(1), states))

In [21]:
def get_values(s, w):
s_ = augmented_states_(s)
return np.dot(s_, w)


### Update rule for the Q-parameters¶

(see [Sut00]):

##### Objective:¶
$$J(w) = (\hat Q^\pi(s_t,a_t) - v_w(\vec s))^2$$

with

• $\hat Q^\pi(s_t,a_t)$: (unbiased) estimate of the value function, here the discounted sum of rewards.
##### Update is proportional to the gradient of the objective:¶
$$\Delta \vec w_t \propto \nabla_w (\hat Q^\pi(s_t,a_t) - v_w(\vec s))^2 = (\hat Q^\pi(s_t,a_t) - v_w(\vec s)) \nabla_w v_w(\vec s)$$

with the linear approximation $v_\phi(\vec s)$:

$$\nabla_w v_w(\vec s) = \nabla_w ( \vec s' \cdot \vec w) = \vec s'$$
In [55]:
def get_delta_w(w, rewards, states, gamma, lr):
states_ = augmented_states_(states)
approximated_values = get_values(states, w)
unbiased_estimate_of_values = get_discounted_sum_of_rewards(rewards, gamma)
grad = ((unbiased_estimate_of_values - approximated_values) * states_.T).sum(axis=1)

In [59]:
nb_rollouts = 3001

n=5
theta = np.random.randn(n)
phi = np.zeros(n)

b_rewards = 0.
ii=0

gamma=0.99
lr = lr_ = 1.e-5
alpha = alpha_ = 0.001

for i in range(nb_rollouts):

rewards, states, actions = roll_out(theta)

delta_phi = get_delta_phi(phi, rewards, states, gamma, lr)
phi += delta_phi

adv = get_discounted_sum_of_rewards(rewards, gamma) - get_values(states, phi)

if i%10==0: # learning rate decay
alpha =  alpha_ * 0.95

delta_theta = get_delta_theta(actions, theta, states, adv, alpha)
theta += delta_theta

b_rewards = 1/(ii+1) * (rewards.sum() + ii * b_rewards)
ii+=1
if i%250==0:
print (i, theta, b_rewards)
b_rewards = 0.
ii=0
#print ("values", get_values(states, phi)[:10])

0 [ 0.01842218 -0.1359281   0.49786343 -0.12111486  1.06499871] 26.0
250 [ 0.01842218 -1.44532046  0.92049717  2.43303793  4.95868241] 137.128
500 [ 0.01842218 -2.85044215  0.99612818  4.41423014  7.44668203] 276.976
750 [  0.01842218   1.63759757   0.43418764   9.67250223  10.62859679] 1736.328
1000 [  0.01842218  -0.30857537   4.64979793  10.90679252  10.00355492] 6843.608
1250 [  0.01842218   0.51386849   6.06029376  12.53461807   9.64501004] 6610.196
1500 [  0.01842218   0.4720908    5.78730549  12.93569815   9.31714145] 9904.092
1750 [  0.01842218   0.40232301   4.22548648  12.96596296  10.49275065] 9974.996
2000 [  0.01842218   0.67517062   5.37792374  12.63457478   9.84750427] 10000.572
2250 [  0.01842218   0.70275676   3.93617859  12.85015721  10.92324032] 9978.772
2500 [  0.01842218   0.53282516   2.41369364  13.29403836  11.73746162] 9913.228
2750 [  0.01842218   1.15124911   2.83149324  13.99078576  11.6775455 ] 9272.64
3000 [  0.01842218   0.86806383   3.97548125  14.40496683  11.36183965] 9866.676


Note: The optimization is now faster and much more stable due to reduced variance.

In [56]:
render(theta)


#### Exercise:¶

• Use tensorflow to implement the REINFORCE algorithm (with baseline correction)

Solution:

From $$\nabla_\theta J(\theta) \approx \sum_{t=1}^{T} \nabla_\theta\pi(a_t \mid s_t, \theta) A_t$$

follows the optimization objective (maximization):

$$J(\theta) \approx \sum_{t=1}^{T} \pi(a_t \mid s_t, \theta) A_t$$

with

• $A_t$: advantage
In [910]:
import tensorflow as tf

In [911]:
def sigmoid_tf(z):
return 1./(1. + tf.exp(-z))

In [912]:
sess = tf.InteractiveSession()

In [913]:
states_tf = tf.placeholder("float64", shape=[None, 4]) # for one episode
roll_out_values_tf = tf.placeholder("float64", shape=[None])
actions_tf = tf.placeholder("float64", shape=[None])


#### Critic¶

In [931]:
w = np.zeros(n)
w_tf = tf.Variable(w.reshape(n, -1))
w_bias = tf.Variable(np.array([0.]))

In [932]:
values_tf = tf.matmul(states_tf, w_tf) + w_bias

##### Critic Objective:¶
$$J(w) = (\hat Q^\pi(s_t,a_t) - v_w(\vec s))^2$$
In [933]:
loss_w = tf.reduce_mean(tf.square(roll_out_values_tf - values_tf))

In [934]:
global_step_w = tf.Variable(0, trainable=False)
starter_learning_rate_w = 0.00001
learning_rate_w = tf.train.exponential_decay(starter_learning_rate, global_step,
10, 0.9, staircase=True)

In [935]:
alpha = 1.e-2
#optimizer_w = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=0.5, use_nesterov=False)
train_step_w = optimizer.minimize(loss_w, var_list=[w_tf, w_bias])#, global_step=global_step)


#### Actor¶

In [936]:
n=4
theta = np.zeros(n)
theta_tf = tf.Variable(theta.reshape(n, -1))

In [937]:
advantage_tf = roll_out_values_tf - tf.reshape(values_tf, [-1])

In [938]:
p_1 = sigmoid_tf(tf.matmul(states_tf, theta_tf))
p_0 = 1. - p_1

low = 1.e-40
log_pi_theta = (actions_tf) * tf.reshape(tf.log(p_1+low), [-1]) + ((1. - actions_tf) *  tf.reshape(tf.log(p_0+low), [-1]))
#log_pi_theta = (actions_tf) * tf.reshape(tf.log(p_1), [-1]) + ((1. - actions_tf) *  tf.reshape(tf.log(p_0), [-1]))
loss = - gain # maximization!

In [939]:
global_step_theta = tf.Variable(0, trainable=False)
starter_learning_rate_theta = 0.001
learning_rate_theta = tf.train.exponential_decay(starter_learning_rate_theta, global_step_theta,
10, 0.9, staircase=True)

In [940]:
#alpha = 0.1e-6
optimizer_theta = tf.train.MomentumOptimizer(learning_rate=learning_rate_theta, momentum=0.9, use_nesterov=False)
train_step_theta = optimizer_theta.minimize(loss, var_list=[theta_tf])#, global_step=global_step)

In [ ]:
#init_op = tf.initialize_all_variables()
init_op = tf.global_variables_initializer()
sess.run(init_op)

In [ ]:
nb_rollouts = 2001

b_rewards = 0.
ii=0

for i in range(nb_rollouts):
theta = theta_tf.eval().reshape(n)
rewards, states, actions = roll_out(theta)

roll_out_values = get_discounted_sum_of_rewards(rewards)

# critic update
train_step_w.run(feed_dict={states_tf: states, roll_out_values_tf: roll_out_values})

# actor update
train_step_theta.run(feed_dict={states_tf: states, roll_out_values_tf: roll_out_values, actions_tf: actions.astype('float64')})

b_rewards = 1/(ii+1) * (rewards.sum() + ii * b_rewards)
ii+=1
if i%100==0:
print (i, theta, w_bias.eval(), b_rewards)
b_rewards = 0.
ii=0

0 [ 0.  0.  0.  0.] [ 0.11608857] 11.0
100 [ -1.56049969   2.04243144   3.57670503  10.86164029] [ 40.24051115] 121.71
200 [  6.40160546   4.53159559   8.75059618  18.13448152] [ 50.92948672] 201.22
300 [ -6.53755688  12.2942023   13.30752632  20.52017335] [ 66.09958111] 568.5
400 [  0.12475472  18.92140922  23.02470601  18.58555769] [ 81.80337178] 2696.77
500 [  0.47523261  13.68496541  23.92864933  27.20355766] [ 96.52455321] 9720.4
600 [  4.42803294  16.05978826  24.02920465  28.12144214] [ 98.68147345] 10001.0
700 [  2.80347074  19.52179063  24.84072479  26.07660755] [ 98.9644464] 9962.47
800 [  2.84854492  13.92065274  24.608145    29.4804424 ] [ 99.00206231] 10001.0
900 [  4.55903587  16.56951481  24.12132129  28.15532598] [ 99.00965304] 10001.0
1000 [  2.77052225   7.7353129   26.73923236  33.19656078] [ 98.84525669] 9674.24
1100 [  2.35017984   7.97580121  29.17981888  34.42012505] [ 98.83918708] 9554.62
1200 [  3.53064081  11.30330201  28.89434916  34.10424319] [ 98.97820397] 9890.4
1300 [  3.71952565  14.26994971  28.43304329  33.18542673] [ 99.00557453] 10001.0
1400 [  0.77805241  13.3472082   28.90803097  33.85287792] [ 99.00646988] 10001.0
1500 [  2.72427873  16.97220724  29.46873887  31.62260521] [ 99.0076811] 9984.29
1600 [  1.32072646  21.49520033  29.79434063  28.02529668] [ 99.00990881] 10001.0