Table of Contents

Basic Prerequisite Knowledge

  • States and Actions: Signified by . Imagine yourself trying to walk through a field to get somewhere. There are two things you need to do to avoid being a vegetable. One, have some knowledge of what’s going on around you. That’s your state. Two, you need to actually do stuff to get there. That’s your action.
  • Trajectory: A trajectory is simply the path you take to try to get to your goal. I also call this a series of “state-action pairs”, which simply means your trajectory develops as each time you decide you want to take an action, you evaluate your state and then take an action (See, it’s so easy a caveman could do it. Rest in peace old GEICO commercials).
  • Policy: The thing our RL agent uses to choose what actions to take
  • Q-Learning: A method that approximates the value of each action. The policy for this would just be to move according to which of these values is better once the algorithm converges. Unfortunately, this often means the action and observations have to be discrete (finite) otherwise you have to estimate the values for an infinite amount of actions, which is infeasible.

Motivation for Policy Gradients

Policy gradient methods perform direct gradient updates on the policy which allows for continuous or high dimensional state and action spaces. Other methods such as Q-learning methods like DQN attempt to find an approximation of the optimal policy using Q-values to represent how good each state action pair is. This means they can only handle discrete and low-dimensional action spaces. Thus policy gradients have value in end to end control tasks such as robotics.

Outlining the Policy Gradient Exploration Environment

Since we are exploring some environment during which we take a series of steps, we define our agent’s trajectory using state, action, reward tuples:

For a policy gradient system, an agent executes a trajectory by choosing actions via a policy and arriving in new states by performing that action under the conditions of the environment dynamics model . In math language this means, and . State decisions are modeled by distribution because taking action in the environment may not deterministically result in the expected state due to noise and other environmental factors.

Deriving the Policy Gradient

Objective Function

The whole point of this reinforcement learning method (and all RL methods for that matter) is to optimize some objective function. In our case that means simply maximizing our expected reward $R$ by optimizing our policy $\theta$.

There are many types of reward functions, for example a Monte Carlo reward function would look like:

Note: The discount factor simply uses the geometric series convergence rule to enforce the reward to be finite under as .

However, we will use a generic reward function for our policy gradient derivation to keep it simple and universally applicable.

As with all gradient based learning methods, in order to maximize our objective function, we need to find its gradient. Once we find the gradient, we can perform gradient ascent (if you want to maximize a reward function) or gradient descent (if you want to minimize a loss function) to achieve our goal. In our case we want to perform gradient ascent on our reward function and to remind you, our objective function is:

Log-Likelihood Trick

To find the gradient, we use the log-likelihood trick, also known as the log-derivative trick.

Leibniz Integral Rule [1] [2] (This rule is tricky. If you don’t want to get into it, just assume this can be done in this case, otherwise, feel free to dive deep.)

$\frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} = \nabla_\theta \log p_\theta(\tau)$ because $\nabla_x \log f(x) = \frac{1}{f(x)}\cdot \nabla_x f(x)$)

Up to this point, this is what we have:

Objective Function

Policy Gradient (unfinished)

Deriving Log Probability of a Trajectory

We know the reward function because it is hand-crafted, so all we now need to derive is the remaining unknown () log probability of an entire trajectory. To do that we need the following prerequisite information.

  1. Sample from initial state distribution $\mu$ where the probability of getting some state
  2. Reminder from earlier:
    • $\pi_\theta(a_t \mid s_t) =$ probability of taking action $a_t$, given state $s_t$, using policy $\pi_\theta$
    • $P(s_{t+1} \mid s_t, a_t)$ = probability of getting to state $s_{t+1}$ from state $s_t$ by taking the action $a_t$ (given by the policy), given the noisy dynamics of the environment.

Final Policy Gradient Definition

Combining all of our information together, we finally have our policy gradient defined as **the gradient of our policy over the expectation of the reward over a trajectory**

Objective Function

Final Policy Gradient

So which is better, Q-Learning or Policy Gradients?

Conveniently for me, this was explained very well by blogger Felix Yu in his own blog post about DQN vs PG. Here’s his breakdown:

Policy Gradients is generally believed to be able to apply to a wider range of problems. For instance, on occasions when the Q function (i.e. reward function) is too complex to be learned, DQN will fail miserably. On the other hand, Policy Gradients is still capable of learning a good policy since it directly operates in the policy space. Furthermore, Policy Gradients usually show faster convergence rate than DQN, but has a tendency to converge to a local optimal. Since Policy Gradients model probabilities of actions, it is capable of learning stochastic policies, while DQN can’t. Also, Policy Gradients can be easily applied to model continuous action space since the policy network is designed to model probability distribution, on the other hand, DQN has to go through an expensive action discretization process which is undesirable.

You may wonder if there are so many benefits of using Policy Gradients, why don’t we just use Policy Gradients all the time and forget about Q Learning? It turns out that one of the biggest drawbacks of Policy Gradients is the high variance in estimating the gradient of . Essentially, each time we perform a gradient update, we are using an estimation of gradient generated by a series of data points accumulated through a single episode of game play. This is known as Monte Carlo method. Hence the estimation can be very noisy, and bad gradient estimate could adversely impact the stability of the learning algorithm. In contrast, when DQN does work, it usually shows a better sample efficiency and more stable performance.

Reducing Variance Without Creating Bias

As noted above, a recurring issue with reinforcement learning, particularly for policy gradients, is variance in our estimations of the gradient of . Agents have difficulty exploring and learning in a smooth, well-defined manner without inducing a biased exploration path. To prevent this post from getting too long, I have moved variance reduction methods to different blog posts. Click below to move through them.

Relevant Topics

  • (DDPG) Deep Deterministic Policy Gradients

Sources