Policy_Gradient_Introduction slides

Policy Gradient¶

Stochastic Policy
Python implementation for the openAI gym Cartpole Example (discrete action space)

(Parametrized) stochastic policy:

$$ \pi_\theta(s, a) = p(a \mid s; \theta) $$

Goal: Maximization of the expected reward of the policy by learning of the parameters $\theta$.

Definition of the policy gradient (e.g. for gradient descent)¶

$$ \nabla_\theta \mathbb E(R) $$

$R$: reward function

Score function gradient estimator¶

The expectation of a function $f(X)$ is defined as: $$ \mathbb E_{X \sim p(X\mid \theta)} [f(X)] = \int_\mathcal X f(X) p(X\mid \theta) dX $$

If we want to compute the gradient of the expectation we can rearrange the formula: $$ \nabla_\theta \mathbb E_{X\sim p(X\mid \theta)} [f(X)] = \nabla_\theta \left( \int_\mathcal X f(X) p(X\mid \theta) dX \right) =\\ \int_\mathcal X f(X) \nabla_\theta \left( p(X\mid \theta) \right) dX =\\ \int_\mathcal X f(X) p(X\mid \theta) \frac{\nabla_\theta \left( p(X\mid \theta) \right) }{p(X\mid \theta)}dX =\\ \int_\mathcal X f(X) p(X\mid \theta)\nabla_\theta \left(\log p(X\mid \theta) \right) dX =\\ \mathbb E_{X\sim p(X\mid \theta)} \left[f(X) \nabla_\theta \left(\log p(X\mid \theta) \right) \right] $$

Score function¶

$$ \nabla_\theta \pi_\theta(s,a) = \pi_\theta(s,a) \frac{\nabla_\theta \pi_\theta(s,a)}{\pi_\theta(s,a)} = \pi_\theta(s,a) \nabla_\theta \log \pi_\theta(s,a) $$

The score function is: $\nabla_\theta \log \pi_\theta(s,a)$

The score function describes how sensitive the stocastic policy $\pi$ is to changes in $\theta$ for a fixed $\theta$.

Softmax Policy¶

Discrete action space.

Linear model for th unnormalized log-probability: $\phi(s,a)^T \theta$,
i.e. weighting of the actions using a linear combination of features $\phi(s,a)$.

Exponentiation and normalization gives the probability for an action $a$ in state $s$:

$$ \pi_\theta (s, a) = \frac{ e^{\phi(s,a)^T \theta }}{\sum_{a' \in \mathcal A} e^{\phi(s,a')^T \theta} } $$

The score function for softmax policy:

$$ \nabla_\theta \log \pi_\theta(s,a) = \phi(s,a) - \mathbb E_{\pi_\theta}[\phi(s,\cdot )] $$

One-Step MDP¶

starting in state $s \sim d(s)$
terminating after one step with reward $R \sim \mathcal R_{s,a}$

$$ J(\theta) = \mathbb{E}[R] = \sum_{s\in \mathcal S} d(s)\sum_{a\in \mathcal A} \pi_\theta(s,a)\mathcal R_{s,a} $$$$ \nabla_\theta J(\theta) = \sum_{s\in \mathcal S} d(s) \sum_{a\in \mathcal A} \pi_\theta(s,a) \nabla_\theta \log \pi_\theta(s,a) \mathcal R_{s,a} = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) R] $$

Stochastic Policy Gradient Theorem¶

For multiple step MDPs the Policy Gradient Theorem generalizes the likelihood ratio approach.
replace instantaneous reward $R$ with long-term value $Q^{\pi_\theta}(s,a)$

For any differentiable policy $\pi_\theta(s,a)$,
for any of the policy objective functions $J = J_1, J_{avgR}$, or $\frac{1}{1-\gamma}J_{avgV}$ the policy gradient is (proof see appendix of [Su00]):

$$ \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q^{\pi_\theta}(s,a)] $$

Practical value of the theorem: the computation of the policy gradient reduces to a simple expectation.

REINFORCE (Monte Carlo Policy Gradient)¶

Estimate the state-action value $Q^{\pi_\theta}(s, a)$ by the discounted reward (return) of the state $G_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \dots + \gamma^{T-t} R_{T}$

The estimated gradient for one episode is $$ \nabla_\theta J(\theta) \approx \sum_{t=1}^T G_t \nabla_\theta \log \pi_\theta(s, a) $$

In [9]:

import numpy as np
import numpy.testing as npt

import matplotlib.pyplot as plt
%matplotlib inline

In [10]:

import gym
env = gym.make('CartPole-v0')

[2017-01-05 11:24:37,138] Making new env: CartPole-v0

Here we have just zwo actions "left"($a=0$) and "right"($a=1$), so we can use the sigmoid (instead of the softmax) for

$$ \pi_\theta (s, a=1) = \text{sigmoid}(\vec \phi(\vec s) \vec \theta + b) $$

Here we don not use (special) features. We just use the states as features: $$ \vec \phi(\vec s) = \vec s $$

In [11]:

def sigmoid(z):
    return 1/(1 + np.exp(-z))

In [12]:

def augmented_states(states):
    aug = np.zeros
    if len(states.shape)==2:
        return np.concatenate((aug(states.shape[0]).reshape(-1,1), states), axis=1)
    else:
        return np.concatenate((aug(1), states))

In [29]:

def get_p(theta, states):
    if len(theta)==5:
        states_ = augmented_states(states) # for bias b (bias trick)
    else:
        states_ = states
    return sigmoid(np.dot(states_, theta))

In [14]:

def get_action(theta, observation):
    p =  get_p(theta, observation)
    return np.random.binomial(n=1, p=p, size=1)[0]

In [15]:

def roll_out(theta, max_episode_length=10000):
    actions = []
    states = []
    observation = env.reset()
    rewards = []
    done = False
    i = 0
    while not done: # TODO max epoch length
        action = get_action(theta, observation)
        states.append(observation)
        actions.append(action)
        observation, reward, done, info = env.step(action)
        rewards.append(reward)
        
        i+=1
        if i>max_episode_length:
           break 
        
    return np.array(rewards), np.array(states), np.array(actions)

#roll_out(theta)

In [16]:

n=5
theta = np.random.randn(n)

In [17]:

rewards, states, actions = roll_out(theta)
#rewards, states, actions

In [18]:

def get_discounted_sum_of_rewards(rewards, gamma=.99):
    adv = np.zeros_like(rewards)
    c = 0.
    for i, r in enumerate(rewards[::-1]):
        c = r + gamma * c
        adv[i] = c
    return adv[::-1]

get_discounted_sum_of_rewards([0., 0., 0., 1.])

Out[18]:

array([ 0.970299,  0.9801  ,  0.99    ,  1.      ])

$$ J(\theta) \approx \sum_{t=1}^{T} \pi(a_t \mid s_t, \theta) A_t $$

with

$A_t$: advantage

Derivative of the sigmoid:

$$ \frac{d}{dx}\text{sigmoid}(x)=\text{sigmoid}(x)(1-\text{sigmoid}(x)) $$

Here: $$ \pi_\theta(\vec s, a=1) = \text{sigmoid}(\vec s^T \vec \theta) $$

$$ \pi_\theta(\vec s, a=0) = 1- \text{sigmoid}(\vec s^T \vec \theta) $$$$ \nabla_\theta \log \pi_\theta(\vec s, a=1) = \frac{1}{\text{sigmoid}(\vec s^T\vec \theta)}\text{sigmoid}(\vec s^T\vec \theta)(1-\text{sigmoid}(\vec s^T \vec \theta)) \vec s = (1-\pi_\theta(\vec s, a=1)) \vec s $$$$ \nabla_\theta \log \pi_\theta(\vec s, a=0) = - \frac{1}{1- \text{sigmoid}(\vec s^T\vec \theta)} \text{sigmoid}(\vec s^T\vec \theta)(1-\text{sigmoid}(\vec s^T\vec \theta)) = - \pi_\theta(\vec s, a=1) \vec s $$

In [32]:

def get_delta_theta(actions, pi, states, advantages, alpha):
    p = get_p(theta, states)
    states_ = augmented_states(states)
    grad = advantages * (-(1. - actions) * p + (actions * (1. - p) )) * states_.T
    
    # clip gradient
    # clip=25.
    #if np.any(np.abs(grad)>clip):
        #print (np.max(grad))
        #grad = clip * grad/np.max(grad) 
          
    return  alpha * np.sum(grad, axis=1)

In [60]:

theta = np.random.randn(n)  

nb_rollouts = 10001

b_rewards = 0.
ii=0
alpha_ = alpha = 0.001

for i in range(nb_rollouts):  
    rewards, states, actions = roll_out(theta)
    adv = get_discounted_sum_of_rewards(rewards, gamma=0.99)
    
    if i%10==0: # learning rate decay
        alpha =  alpha_ * 0.9
        
    delta_theta = get_delta_theta(actions, theta, states, adv, alpha)
    theta += delta_theta 
    
    # just for progress printing
    b_rewards = 1/(ii+1) * (rewards.sum() + ii * b_rewards) 
    ii+=1
    if i%250==0:
        print (i, theta, b_rewards)
        b_rewards = 0.
        ii=0

0 [-0.17344432  0.64746428  0.70026023  1.22031105  0.34207346] 18.0
250 [-0.17344432 -0.13518532  2.09379432  3.41064048  4.84379737] 124.816
500 [-0.17344432 -3.01573871  2.98143799  5.86882375  6.95513493] 298.428
750 [ -0.17344432   1.70650257   2.99633254  10.31686178  12.2067736 ] 728.516
1000 [ -0.17344432  -0.77927747   1.34002176  11.08589519  12.72645203] 296.244
1250 [ -0.17344432  -1.39062293   4.61079084  13.38255253  15.50965289] 2276.736
1500 [ -0.17344432   2.29439042   1.02429169  13.92594869  17.82305295] 620.688
1750 [ -0.17344432  -0.99215483   0.8042833   14.27091883  18.84466158] 313.4
2000 [ -0.17344432   1.86845107  -0.49193008  14.37816835  18.93449913] 358.656
2250 [ -0.17344432  -2.64385232   2.22809242  14.89598477  18.34140148] 315.512
2500 [ -0.17344432  -3.63304082   2.40439079  14.8589223   18.36400163] 232.54
2750 [ -0.17344432   1.19635986   7.71113422  15.54115281  19.31382173] 1786.332
3000 [ -0.17344432   0.40614209   6.66550218  18.37557547  20.83997478] 2913.464
3250 [ -0.17344432  -6.79415416   4.4958923   18.32622474  20.90250196] 349.152
3500 [ -0.17344432  -5.64888397   6.21307541  18.38975317  20.8351785 ] 255.148
3750 [ -0.17344432  -3.83104739   7.28885398  18.75303348  22.29447309] 450.168
4000 [ -0.17344432   1.56367768  11.59714601  18.82763179  20.53774896] 1609.012
4250 [ -0.17344432  -3.82586597  15.20375561  20.24393617  21.76375826] 7400.452
4500 [ -0.17344432   1.50381509   8.64879678  20.89032646  27.73195129] 4161.032
4750 [ -0.17344432   4.14080306  11.87169708  20.81760842  28.41168812] 1280.944
5000 [ -0.17344432   3.13668488  14.04795733  21.11576592  33.14025838] 6680.832
5250 [ -0.17344432   1.80384902  16.65673573  22.24396779  32.96400492] 8052.008
5500 [ -0.17344432   4.28514843  21.36812508  22.5697242   32.85923548] 9307.632
5750 [ -0.17344432   1.85217412  17.554174    24.13065606  39.51369869] 7623.588
6000 [ -0.17344432   1.99825464  23.45130849  24.60408862  37.42754604] 9850.584
6250 [ -0.17344432   0.59196073  35.33622866  26.41746464  28.19440256] 5888.216
6500 [ -0.17344432   0.66762881  25.02187502  26.30635793  42.16254313] 8902.704
6750 [ -0.17344432  -6.68060115  24.34457221  28.14226668  40.63795847] 4903.96
7000 [ -0.17344432   1.10764232  28.63616497  28.7667267   37.68566725] 4599.712
7250 [ -0.17344432   2.50274902  28.00571915  28.343935    39.24390679] 10001.0
7500 [ -0.17344432   4.3386294   31.28127639  28.61524992  39.02199538] 9650.452
7750 [ -0.17344432   4.36962477  27.82249491  28.84423521  40.98967973] 10001.0
8000 [ -0.17344432  -0.20095448  32.69302076  32.58432046  36.21854967] 7287.032
8250 [ -0.17344432   3.67461367  34.04059165  30.59776953  37.98667424] 10001.0
8500 [ -0.17344432   2.26556557  29.52227394  32.85573795  41.6915911 ] 9988.384
8750 [ -0.17344432   5.50582341  26.75623069  33.40331956  43.65194207] 10001.0
9000 [ -0.17344432   6.84509867  32.38648962  33.4425614   43.27093226] 8263.116
9250 [ -0.17344432   1.80252043  18.32346     35.26901269  52.02101994] 10001.0
9500 [ -0.17344432   2.15276574  17.90183915  35.88422216  51.82852184] 9741.276
9750 [ -0.17344432   2.81284184  10.58769449  36.5159464   53.94835712] 8047.124
10000 [ -0.17344432   2.69117313   2.45619903  36.86371001  54.41119968] 2177.88

After improvement the average accumulated reward drops sometimes suddenly probably due to high variance.

In [19]:

def render(theta, max_episode_length=5000):
    observation  = env.reset()
    for t in range(max_episode_length):
        env.render()
        action = get_action(theta, observation)
        #print ("random action:", action)
        observation, reward, done, info = env.step(action)
    # observation =  
    #  [position of cart, velocity of cart, angle of pole, angular velocity of pole]
    #print (observation, reward, done, info)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
        
    env.render(close=True)   

render(theta)

Episode finished after 362 timesteps

REINFORCE with Baseline¶

for reducing variance

REINFORCE has very high variance.

One recipe for reducing the variance is to substract a baseline (also called control variate) from the return. The baseline have to have zero mean, so it will not effect the expectation.

As baseline the state value $v(s_t)$ function is used (see [Su00]).

The estimated gradient for one episode is $$ \nabla_\theta J(\theta) \approx \sum_{t=1}^T (G_t - v(s_t)) \nabla_\theta \log \pi_\theta(s, a) $$

This can be pluged in an optimization algorithm (e.g. stochastic gradient descent) for optimizing the policy.

$$ A_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \dots + \gamma^{T-t} R_{T}- v(s_t) $$

As baseline we use the value function of the state $s_t$. So we must estimate the baseline, too.

Linear approximation of the value function $v(\vec s)$¶

$$ v_w(\vec s) = \vec s \cdot \vec w_{1:} + w_0 $$

or with $\vec s' = (1, s_1, s_2, \dots , s_n)$ and $\vec w^T = (w_0, w_{1}, \dots w_n)$

$$ v_w(\vec s) = \vec s' \cdot \vec w $$

Note: The cartpole example also works without the bias term $w_0$, which would simplyfy the code.

In [61]:

def augmented_states_(states, aug = np.ones):
    states = np.abs(states)
    if len(states.shape)==2:
        return np.concatenate((aug(states.shape[0]).reshape(-1,1), states), axis=1)
    else:
        return np.concatenate((aug(1), states))

In [21]:

def get_values(s, w):
    s_ = augmented_states_(s)
    return np.dot(s_, w)

Update rule for the Q-parameters¶

(see [Sut00]):

Objective:¶

$$ J(w) = (\hat Q^\pi(s_t,a_t) - v_w(\vec s))^2 $$

with

$\hat Q^\pi(s_t,a_t)$: (unbiased) estimate of the value function, here the discounted sum of rewards.

Update is proportional to the gradient of the objective:¶

$$ \Delta \vec w_t \propto \nabla_w (\hat Q^\pi(s_t,a_t) - v_w(\vec s))^2 = (\hat Q^\pi(s_t,a_t) - v_w(\vec s)) \nabla_w v_w(\vec s) $$

with the linear approximation $v_\phi(\vec s)$:

$$ \nabla_w v_w(\vec s) = \nabla_w ( \vec s' \cdot \vec w) = \vec s' $$

In [55]:

def get_delta_w(w, rewards, states, gamma, lr):
    states_ = augmented_states_(states)
    approximated_values = get_values(states, w)
    unbiased_estimate_of_values = get_discounted_sum_of_rewards(rewards, gamma)
    grad = ((unbiased_estimate_of_values - approximated_values) * states_.T).sum(axis=1)
    return lr * grad

In [59]:

nb_rollouts = 3001

n=5
theta = np.random.randn(n)  
phi = np.zeros(n) 

b_rewards = 0.
ii=0

gamma=0.99
lr = lr_ = 1.e-5
alpha = alpha_ = 0.001

for i in range(nb_rollouts):
    
    rewards, states, actions = roll_out(theta)
    adv = get_discounted_sum_of_rewards(rewards, gamma)
    
    delta_phi = get_delta_phi(phi, rewards, states, gamma, lr)
    phi += delta_phi
    
    adv = get_discounted_sum_of_rewards(rewards, gamma) - get_values(states, phi)
    
    if i%10==0: # learning rate decay
        alpha =  alpha_ * 0.95
        
    delta_theta = get_delta_theta(actions, theta, states, adv, alpha)
    theta += delta_theta 

    b_rewards = 1/(ii+1) * (rewards.sum() + ii * b_rewards) 
    ii+=1
    if i%250==0:
        print (i, theta, b_rewards)
        b_rewards = 0.
        ii=0
        #print ("values", get_values(states, phi)[:10])

0 [ 0.01842218 -0.1359281   0.49786343 -0.12111486  1.06499871] 26.0
250 [ 0.01842218 -1.44532046  0.92049717  2.43303793  4.95868241] 137.128
500 [ 0.01842218 -2.85044215  0.99612818  4.41423014  7.44668203] 276.976
750 [  0.01842218   1.63759757   0.43418764   9.67250223  10.62859679] 1736.328
1000 [  0.01842218  -0.30857537   4.64979793  10.90679252  10.00355492] 6843.608
1250 [  0.01842218   0.51386849   6.06029376  12.53461807   9.64501004] 6610.196
1500 [  0.01842218   0.4720908    5.78730549  12.93569815   9.31714145] 9904.092
1750 [  0.01842218   0.40232301   4.22548648  12.96596296  10.49275065] 9974.996
2000 [  0.01842218   0.67517062   5.37792374  12.63457478   9.84750427] 10000.572
2250 [  0.01842218   0.70275676   3.93617859  12.85015721  10.92324032] 9978.772
2500 [  0.01842218   0.53282516   2.41369364  13.29403836  11.73746162] 9913.228
2750 [  0.01842218   1.15124911   2.83149324  13.99078576  11.6775455 ] 9272.64
3000 [  0.01842218   0.86806383   3.97548125  14.40496683  11.36183965] 9866.676

Note: The optimization is now faster and much more stable due to reduced variance.

In [56]:

render(theta)

Exercise:¶

Use tensorflow to implement the REINFORCE algorithm (with baseline correction)

Solution:

From $$ \nabla_\theta J(\theta) \approx \sum_{t=1}^{T} \nabla_\theta\pi(a_t \mid s_t, \theta) A_t $$

follows the optimization objective (maximization):

$$ J(\theta) \approx \sum_{t=1}^{T} \pi(a_t \mid s_t, \theta) A_t $$

with

$A_t$: advantage

In [910]:

import tensorflow as tf

In [911]:

def sigmoid_tf(z):
    return 1./(1. + tf.exp(-z))

In [912]:

sess = tf.InteractiveSession()

In [913]:

states_tf = tf.placeholder("float64", shape=[None, 4]) # for one episode 
roll_out_values_tf = tf.placeholder("float64", shape=[None]) 
actions_tf = tf.placeholder("float64", shape=[None])

Critic¶

In [931]:

w = np.zeros(n)  
w_tf = tf.Variable(w.reshape(n, -1))
w_bias = tf.Variable(np.array([0.]))

In [932]:

values_tf = tf.matmul(states_tf, w_tf) + w_bias

Critic Objective:¶

$$ J(w) = (\hat Q^\pi(s_t,a_t) - v_w(\vec s))^2 $$

In [933]:

loss_w = tf.reduce_mean(tf.square(roll_out_values_tf - values_tf))

In [934]:

global_step_w = tf.Variable(0, trainable=False)
starter_learning_rate_w = 0.00001
learning_rate_w = tf.train.exponential_decay(starter_learning_rate, global_step,
                                           10, 0.9, staircase=True)

In [935]:

alpha = 1.e-2
optimizer = tf.train.GradientDescentOptimizer(learning_rate=alpha)
#optimizer = tf.train.AdamOptimizer(alpha)
#optimizer_w = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=0.5, use_nesterov=False)
train_step_w = optimizer.minimize(loss_w, var_list=[w_tf, w_bias])#, global_step=global_step)

Actor¶

In [936]:

n=4 
theta = np.zeros(n)  
theta_tf = tf.Variable(theta.reshape(n, -1))

In [937]:

advantage_tf = roll_out_values_tf - tf.reshape(values_tf, [-1])

In [938]:

p_1 = sigmoid_tf(tf.matmul(states_tf, theta_tf))
p_0 = 1. - p_1

low = 1.e-40
log_pi_theta = (actions_tf) * tf.reshape(tf.log(p_1+low), [-1]) + ((1. - actions_tf) *  tf.reshape(tf.log(p_0+low), [-1]))
#log_pi_theta = (actions_tf) * tf.reshape(tf.log(p_1), [-1]) + ((1. - actions_tf) *  tf.reshape(tf.log(p_0), [-1]))
gain = tf.reduce_sum(advantage_tf * log_pi_theta)
loss = - gain # maximization!

In [939]:

global_step_theta = tf.Variable(0, trainable=False)
starter_learning_rate_theta = 0.001
learning_rate_theta = tf.train.exponential_decay(starter_learning_rate_theta, global_step_theta,
                                           10, 0.9, staircase=True)

In [940]:

#alpha = 0.1e-6
#optimizer_theta = tf.train.GradientDescentOptimizer(learning_rate=alpha)
#optimizer = tf.train.AdamOptimizer(alpha)
optimizer_theta = tf.train.MomentumOptimizer(learning_rate=learning_rate_theta, momentum=0.9, use_nesterov=False)
train_step_theta = optimizer_theta.minimize(loss, var_list=[theta_tf])#, global_step=global_step)

In [ ]:

#init_op = tf.initialize_all_variables()
init_op = tf.global_variables_initializer()
sess.run(init_op)

In [ ]:

nb_rollouts = 2001

b_rewards = 0.
ii=0

for i in range(nb_rollouts):
    theta = theta_tf.eval().reshape(n)
    rewards, states, actions = roll_out(theta)
    
    roll_out_values = get_discounted_sum_of_rewards(rewards)
    
    # critic update
    train_step_w.run(feed_dict={states_tf: states, roll_out_values_tf: roll_out_values})
    
    # actor update
    train_step_theta.run(feed_dict={states_tf: states, roll_out_values_tf: roll_out_values, actions_tf: actions.astype('float64')})
    
    b_rewards = 1/(ii+1) * (rewards.sum() + ii * b_rewards) 
    ii+=1
    if i%100==0:
        print (i, theta, w_bias.eval(), b_rewards)
        b_rewards = 0.
        ii=0

0 [ 0.  0.  0.  0.] [ 0.11608857] 11.0
100 [ -1.56049969   2.04243144   3.57670503  10.86164029] [ 40.24051115] 121.71
200 [  6.40160546   4.53159559   8.75059618  18.13448152] [ 50.92948672] 201.22
300 [ -6.53755688  12.2942023   13.30752632  20.52017335] [ 66.09958111] 568.5
400 [  0.12475472  18.92140922  23.02470601  18.58555769] [ 81.80337178] 2696.77
500 [  0.47523261  13.68496541  23.92864933  27.20355766] [ 96.52455321] 9720.4
600 [  4.42803294  16.05978826  24.02920465  28.12144214] [ 98.68147345] 10001.0
700 [  2.80347074  19.52179063  24.84072479  26.07660755] [ 98.9644464] 9962.47
800 [  2.84854492  13.92065274  24.608145    29.4804424 ] [ 99.00206231] 10001.0
900 [  4.55903587  16.56951481  24.12132129  28.15532598] [ 99.00965304] 10001.0
1000 [  2.77052225   7.7353129   26.73923236  33.19656078] [ 98.84525669] 9674.24
1100 [  2.35017984   7.97580121  29.17981888  34.42012505] [ 98.83918708] 9554.62
1200 [  3.53064081  11.30330201  28.89434916  34.10424319] [ 98.97820397] 9890.4
1300 [  3.71952565  14.26994971  28.43304329  33.18542673] [ 99.00557453] 10001.0
1400 [  0.77805241  13.3472082   28.90803097  33.85287792] [ 99.00646988] 10001.0
1500 [  2.72427873  16.97220724  29.46873887  31.62260521] [ 99.0076811] 9984.29
1600 [  1.32072646  21.49520033  29.79434063  28.02529668] [ 99.00990881] 10001.0

Literature and Links:¶

[Su00] Richard S. Sutton, David McAllester, Satinder Singh, Yishay Mansour: Policy Gradient Methods for Reinforcement Learning with Function Approximation, Advances in Neural Information Processing Systems 12, pp. 1057-1063, MIT Press, 2000
R. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction, Second edition, Chapter 13: Policy Gradient Methods, MIT press (in preparation)
Richard S. Sutton, Satinder Singh, David McAllester: Comparing Policy-Gradient Algorithms
Richard S. Sutton: Tutorial Introduction to Reinforcement Learning with Function Approximation, NIPS Dec 7, 2015
D. Silver. UCL Course on RL
Thomas Degris, Patrick M. Pilarski, Richard S. Sutton: Model-Free Reinforcement Learning with Continuous Action in Practice, 2012 American Control Conference (ACC). IEEE, 2012.
J. Schulman: Deep Reinforcement Learning: Video: Part 1 of 4, Slides, MLSS, 2016
Shakir: Machine Learning Trick of the Day (5): Log Derivative Trick