Double DQN

Credit
Summary of the Abstract
Improvements from Classic DQN
So why is DQN still not good enough, and how does Double Q-Learning do better?
TLDR
Relevant Topics
Sources

Credit

This blog post is simply a summary and overview of the paper listed below. My own opinions may be reflected here as well. For reference to the authors’ own work please see the references at the very bottom of this page.

Paper: Deep Reinforcement Learning with Double Q-learning

Credit for all motivating work goes to the authors of the Double DQN paper listed below.

Hado van Hasselt, Google DeepMind
Arthur Guez, Google DeepMind
David Silver, Google DeepMind

Summary of the Abstract

Generally, Q-Learning and its deep learning counterpart, Deep Q-Learning, have had issues with high variance results because of overestimates in their action values. The authors of this paper show that Double Q-Learning techniques can successfully generalize to Deep Q Networks (DQN) and prevent such overestimations. The resulting Double DQN algorithm reduces the observed overestimations and leads to much better performance on several Atari games.

Improvements from Classic DQN

The formulation for standard DQN is as follows. We want to learn an action-value, $Q(s,a;\theta_i)$ , parameterized by an action network $\theta_i$ , to output how “good” a particular action is. To do that, we optimize the weights of $\theta_i$ so that it outputs a value that is close to a target-value, $Y^{DQN}$ , that we sample by exploring the environment. After taking a step in the environment, the target value is determined by adding the expected future reward from that state to the actual reward for that state. We use the classic Q-learning update policy to estimate the expected future reward, so we get $Y^{DQN} = r + \gamma \max_{a'} Q(s',a'; \theta_i^-)$ .

Notice that the Q-function in the target value equation is parameterized by $\theta_i^-$ and not $\theta_i$ as in the action-value Q-function. This network, $\theta_i^-$ , is a frozen version of the online action-value network, $\theta_i$ . This means that the target network is occasionally copied from the action-value network, so that $\theta_i^- = \theta_i$ after every $\tau$ timesteps. For every other timestep, the target network is not updated even though the action-value network continues to be optimized. If we look at the loss function, we see why this helps reduce variance.

$\begin{align*} L_i(\theta_i) &= \mathbb{E}_{(s, a, r, s') \sim \textrm{U(D)}} \bigg[\bigg(Y^{DQN} - Q(s,a; \theta_i)\bigg)^2\bigg] \\ &= \mathbb{E}_{(s, a, r, s') \sim \textrm{U(D)}} \bigg[\bigg(r + \gamma \max_{a'} Q(s',a'; \theta_i^-) - Q(s,a; \theta_i)\bigg)^2\bigg] \end{align*}$

With this loss function, our goal is to optimize $\theta_i$ such that $Q(s,a;\theta_i)$ will most accurately predict the action-value (a.k.a. the optimal expected reward) in a given a state-action pair. Our best approximation for this expected reward value is our empirical evaluation of $Y^{DQN}$ . However, notice that if we always use the same $\theta_i$ in both the equation for our target-value, so $Y^{DQN} = r + \gamma \max_{a'} Q(s',a'; \theta_i)$ and our action-value Q-function, $Q(s,a;\theta_i)$ , then, for our next training iteration, we end up evaluating an action-value, $Q(s,a;\theta_{i+1})$ that is optimized to do well on the data collected from an old target-value network, $Y^{DQN} = r + \gamma \max_{a'} Q(s',a'; \theta_i)$ . However, in this new iteration, the target-value network is now generating expected future reward estimates using the new parameters, $\theta_{i+1}$ , because we are always using the same parameters for our target and action value estimation. The resulting exploration data from these new parameters can have a distribution that is highly variant from that of the previous timestep even if the update to the action-value network, $\theta_i$ , was small. This has the effect of “chasing a moving target” and it is one of the major issues with reinforcement learning. High variance exploration and overestimated action-values commonly arise because we use a network to collect informative data samples, then we use this data to update our network. This data, however, will cause high variance updates because the updated network will be was biased towards the exploration data distribution generated by the previous iteration of your network. This might be confusing at first, but hopefully if you read that sentence back a few times, it will make sense.

Phew.

As an exercise, now think of why freezing the network, $\theta_i^-$ for the target-value equation for several timesteps will help mitigate the “chasing a moving target” problem.

So why is DQN still not good enough, and how does Double Q-Learning do better?

So, we have mitigated (to an extent) the coupling between the action-value estimation and the target-value estimation in our DQN algorithm, so we are no longer trying to “hit a moving target” when we optimize $\theta_i$ . However, if we look closer, there is still a parameter coupling going on in our target-value equation.

Let’s take a look at the Q-function inside the target-value equation $\max_{a'} Q(s',a'; \theta_i^-)$ . We can rewrite this function as so:

$\max_{a'} Q(s',a'; \theta_i^-) = Q(s', \textrm{argmax}_{a'} Q(s',a'; \theta_i^-); \theta_i^-)$

If we read this back in English, the left side is equivalent to the maximum Q-value given the current state and all possible actions from this state” and the right side is “the Q-value generated by the action from this state that would give the maximum Q-value given the current state and all possible actions from this state”. You can see that these are equivalent. However, we see a coupling on the right side between the evaluation of the Q-function (the outer Q-function) and the method we use to sample the action we use for our update (the inner Q-function). Because standard DQN uses the same parameters, in this case $\theta_i^-$ , to both select and evaluate an action, it is more likely to select overestimated values, resulting in overoptimistic value estimates.

So, Double Q-Learning tells us to decouple the parameters into two independent sets of parameters, $\theta_i$ and $\theta_i^-$ like so:

$Y^{\textrm{Double Q-Learning}} = r + \gamma Q(s', \textrm{argmax}_{a'} Q(s',a'; \theta_i); \theta'_i)$

Using this formulation, the action selection policy we use to predict future rewards is still parameterized using the weights $\theta_i$ , so we are still estimating the value of the greedy policy using the online network weights. However, we use a second independent set of weights $\theta'_i$ to fairly evaluate the value of this policy without being tied to the action-selection policy.

For Double Q-learning $\theta_i$ and $\theta'_i$ are just two independent networks that are learned by assigning each experience randomly to update one of the two networks. For each update, one set of weights is used to determine the greedy policy and the other to determine its value. For Double DQN, it is easiest to just use $\theta'_i = \theta_i^-$ to transition standard DQN to Double DQN where $\theta_i^-$ is updated in the same way that it is updated in the standard DQN algorithm (frozen and periodically copied from $\theta_i$ ).

$Y^{\textrm{Double DQN}} = r + \gamma Q(s', \textrm{argmax}_{a'} Q(s',a'; \theta_i); \theta_i^-)$

TLDR

In short, standard DQN uses a target-value network, $\theta_i^-$ , that is different than the online action-value network, $\theta_i$ , in that it is frozen for a set number of timesteps $\tau$ before it is copied from $\theta_i$ , which is updated every timestep. This decouples the action-value estimation, $Q(s,a; \theta_i)$ , from the target-value estimation, $r + \gamma \max_{a'} Q(s',a'; \theta_i^-)$ .

$L_i(\theta_i) = \mathbb{E}_{(s, a, r, s') \sim \textrm{U(D)}} \bigg[\bigg(r + \gamma \max_{a'} Q(s',a'; \theta_i^-) - Q(s,a; \theta_i)\bigg)^2\bigg]$

However, Double DQN further iterates on this by decoupling the action-sampling policy (inner Q-function) from the action-evaluation function (outer Q-function) within the target-value equation. So, $Y^{\textrm{Double DQN}} = r + \gamma Q(s', \textrm{argmax}_{a'} Q(s',a'; \theta_i); \theta_i^-)$ and

$L_i(\theta_i) = \mathbb{E}_{(s, a, r, s') \sim \textrm{U(D)}} \bigg[\bigg(r + \gamma Q(s', \textrm{argmax}_{a'} Q(s',a'; \theta_i); \theta_i^-) - Q(s,a; \theta_i)\bigg)^2\bigg]$

Relevant Topics

(DQN) Deep Q-Learning
Double Q-Learning

Sources

[Paper] Deep Reinforcement Learning with Double Q-learning
- https://arxiv.org/pdf/1509.06461.pdf
Human-level control through deep reinforcement learning
- https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf

Table of Contents