Table of Contents

Description

Problem 1

We don’t know the true expectation of our objective functions, so we sample trajectories using a Monte Carlo approach

Problem 2

Sampling trajectories using a Monte Carlo return ($\sum_{t=0}^{T}\gamma^t r_{t+1}$) results in a high variance estimator of the expected return because the policy and environment dynamics are stochastic and chaining stochastic samples of states and actions into long trajectories can result in high variance returns.

Using an Unbiased Baseline Function to Reduct Variance

  • Assume non-discounted rewards for simplicity ($R(\tau) = \sum_{t=0}^{T-1} r_t$)
  1. PG = Expected return from cumulative (full) trajectory reward multiplied by the gradient of the log probability of taking that entire trajectory
  2. PG = Expected return fromt he sum of the reward at each timestep multiplied by the sum of gradients of the log probabilities of taking the trajectory up to said timestep.

a. $R(\tau)=\sum_{t=0}^{T-1}r_t$ b. Linearity of Expectation c. Same as PG equation, but trajectory is limited to timestep of current reward ($t’$) (Reminder of PG equation: ) d. Linearity of Expectation

Start from the log-likelihood trick

Frame in terms of some trajectory $\tau$

Frame function $f$ as a reward function $R$

Key: $R$ is simply some output given some trajectory

  • Normally $R$ is the sum of rewards at each timestep over a trajectory (i.e. $\sum_{t=0}^{T-1}r_t$), which is intuitive
  • Less intuitively, it can also be a single reward at some timestep $t’$ during a given trajectory where $R=(r_{t’} \mid \tau_{0:t’})$ (Still dependent on the trajectory, so it is a valid continuous function).

When $R=(r_{t’} \mid \tau_{0:t’})$:

  • $R$ is only affected by the trajectory taken up to timestep $t’$
  1. Remember the PG equation:

Let $f_t := \nabla_\theta \log \pi_\theta(a_t \mid s_t)$: