Table of Contents

Paper

Issue References

Objective and Introduction

This paper is basically concerned with a novel way to solve an issue called data correlation which I described in Issue #1 of my Common Issues with Reinforcement Learning post, but also quoted here for your convenience.

To understand data correlation and why it is a problem, we first have to understand that data is generally “uniformly (equally) important”. Even though states vary in their “reward value” (ex. a game agent like Sonic will try to avoid bad reward states like dangerous traps or beasts, and achieve good reward states like getting coins), each state holds equally important information that allows an agent to learn what to do in any situation regardless of its reward value (i.e. If Sonic is standing in the middle of nowhere, this isn’t a particularly low or high reward state, however this state should be represented equally in the data set used for updating the policy because Sonic needs to learn what to do here regardless of how often it is visited). So, ideally you want to agent to go out there, observe as many things as possible and learn from these experiences in an unbiased, “equal opportunity”, kind of way. However, in reality, if an agent is being trained “online”, meaning it accumulates reward along its trajectory and updates its policy using sequential data samples along the way, it will have gathered biased data samples because states visited in the earlier parts of the trajectory affect where the agent ends up in the future. This is what I mean by data correlation.

Another terrible effect of data correlation is non-stationarity. This means that data correlation causes high variance in the trajectories explored and the probability distribution across state action pairs is unstable. What this means is that small updates to the policy function (ex. a neural network) can cause the agent to follow vastly different exploration paths. This causes learning to have a difficult time converge. If using a neural network, you can imagine why this is the case. Gradient descent uses derivatives and the chain rule to directly update your network’s weights in the direction that will definitively improve your overall reward given the data samples used to update the network. Read the bold part again. You see the problem here? This means that the policy is only updated to maximize reward given the all the state, action, rewards seen in the batch of trajectories explored by the agent before the update. However, the resulting update will almost definitely cause the agent to explore an entirely different series of states and actions every time. This is why so many papers reference the issues of high variance in reinforcement learning. Online training using consecutive data samples creates bias in exploration paths and reward collection, which in turn, causes high variance updates to the policy.

Sooo, anyways, back to the paper. A very common way to randomize data samples and break data correlations is experience replay. Experience replay works by storing state, action, resulting state, and reward tuples into a memory buffer and allows the algorithm to randomly sample this buffer to do unbiased updates to the policy. This has two key benefits. It breaks data correlations and it allows us to revisit possibly rare experiences that might be repeatedly useful.

However, experience replay also has its downsides. It uses a lot of memory and computation per interaction and requires an off-policy algorithm to perform updates using data generated by an older policy. To solve these issues, Minh et al. propose an alternative method to asynchronously execute multiple agents in parallel on multiple instances of the environment. This has a similar effect as experience replay because at any given time step, the parallel agents experience a wide variety of different states, thus decorrelating the agent’s data into a more stationary processes.

At any given timestep, each parallel agent will be experiencing a wide variety of different states already, we can now use on-policy algorithms because we no longer have to use data from older policies to update our current policy.

Relevant Topics

  • Experience Replay
  • (A3C) Asynchronous Methods for Seep Reinforcement Learning

Sources