Black-Box-Policy-Optimization slides

Black Box Policy Gradient

Here we demonstrate how the policy can be directly be learned by the cross entropy method (some kind of evolutionary algorithm) as suggested in the Open AI Deep Reinforcement Learning Tutorial.

The core idea of the cross entropy method is similar to evolution strategy[Wie14]:

  • maintain a search distribution
  • sampling from the search distribution
  • and evaluate the samples for updating the search distribution, iteratively.

Here we use a search distribution a multivariate Gaussian with diagonal covariance matrix. To update the search distribution the best offsprings (elite) are used to calculate the new parameters of the Gaussian (the mean and the variance).

Cart Pole Problem

In [2]:
import numpy as np
import numpy.testing as npt

import matplotlib.pyplot as plt
%matplotlib inline
In [3]:
import gym
env = gym.make('CartPole-v0')
[2017-05-12 09:25:17,431] Making new env: CartPole-v0
In [4]:
In [5]:
observation = env.reset()
#[position of cart, velocity of cart, angle of pole, rotation rate of pole]
array([-0.0258769 , -0.04715221,  0.01703425, -0.03715459])
In [6]:

The action space is one-dimensional with two discrete actions $a \in \{0, 1\}$.

We use (similar to logistic regression) a log-linear model (logistic regression) for the probability of action $a=1$ for state $\vec s$ parametrized by $\theta = \{ b, \vec w\}$

$$ p(a=1 \mid \vec s; \theta) = \frac{1}{1+\exp(-(\vec s \cdot \vec w)+b)} $$

The probability of action $a=0$ is $p(a=0 \mid \vec s; \theta) = 1-p(a=1 \mid \vec s; \theta)$

In [7]:
def logistic_function(z):
    return 1/(1 + np.exp(-z))
In [8]:
plt.figure(figsize=(3, 1.5))
z = np.arange(-5, 5, 0.1)
plt.plot(z, logistic_function(z), 'b-')
plt.title("logistic function")
plt.ylim(-0.1, 1.1)
(-0.1, 1.1)

Sampling the action for an observation (state) and a fixed $\theta$ from $p(a \mid \vec s; \theta)$:

In [9]:
def get_action(theta, observation):
    p =  logistic_function([1:], observation) + theta[0])
    return np.random.binomial(n=1, p=p, size=1)[0]

Implementation of Algorithm 1: Cross Entropy Method:

Note: The environment stops automatically after 200 steps.

In [13]:
def optimize_policy(nb_generations = 200, 
                    nb_offsprings = 250,
                    nb_elite = 50, 
    # start with a random mean and random diagonal covariance matrix
    mu = np.random.rand(n) * 10. - 5.
    cov = np.eye(n) * np.random.rand(n) * 10.
    for ii in range(nb_generations):
        # Draw random samples for the offsprings from a multivariate normal distribution
        offsprings = np.random.multivariate_normal(mean=mu, cov = cov, size=nb_offsprings)
        # the returns (fittness value) for each offspring
        returns = np.zeros(nb_offsprings)
        for i, theta in enumerate(offsprings):
            done = False
            observation  = env.reset()
            return_ = 0
            # run the agent in the environment to get the corresponding return (fitnesvalue)
            while not done:
                action = get_action(theta, observation)
                observation, reward, done, info = env.step(action)
                return_ += reward
            returns[i] = return_     

        # sort the offsprings according to the fitness values 
        # and compute the new gaussian parameters (mu and cov) from the 
        # best offsprings
        rang = np.argsort(returns)[::-1]
        mu = offsprings[rang[:nb_elite]].mean(axis=0)
        cov = np.eye(n) * offsprings[rang[:nb_elite]].var(axis=0)
        #print progress
        #if ii%10 == 0: 
        #    print (ii, returns[rang[:nb_elite]].mean())
        if returns[rang[:nb_elite]].mean() >= 200.:
            return mu

    return mu
theta = optimize_policy()
In [14]:
array([ 0.15929439, -0.59793475,  1.06206644,  4.38717968,  3.77750642])

Use the following code snippet for

In [15]:
observation  = env.reset()
for t in range(5000):
    action = get_action(theta, observation)
    #print ("random action:", action)
    observation, reward, done, info = env.step(action)
    # observation =  
    #  [position of cart, velocity of cart, angle of pole, angular velocity of pole]
    #print (observation, reward, done, info)
    if done:
        print("Episode finished after {} timesteps".format(t+1))
Episode finished after 116 timesteps

Open Questions / Exercises

  • Try other kinds of evolutionary algorithms, e.g. evolution stategy [Wie14] and compare the results.
  • Does it help to fit the full covariance matrix not only the diagonal elements?