RL Basics: Policy Gradients and A2C

RL Setup

The goal of Reinforcement Learning (RL) is to teach an agent to take decisions by learning from its own experience. The agent gains experience by interacting with its environment and updating its decision policy. The figure above shows the canonical RL setup. The agent gets the current state of the environment $s_t$ as input, uses the current decision policy ($\pi$) to take action $a_t$ which alters the environment. The environment transitions to state $s_{t+1}$ and the agent gets an immediate reward $r(s_t,a_t)$. The transition dynamics are assumed to be Markovian i.e. $P(s_{t+1})$ is dependent only on previous state $s_t$ and previous action $a_t$.

The agent’s objective is to learn a policy that maximizes not only instantaneous rewards but rather the sum of all future rewards. We formalize this by defining a trajectory ($\tau$) as the sequence of state and action ($s_t$ and $a_t$) random variables as the agent-environment interaction evolves. We assume that the episode ends after $T$ steps.

\[\tau = (s_0, a_0,\dots, s_T,a_T)\]

The policy, often represented by a neural net, is parameterized by $\theta$. The goal of RL is to maximize total future reward averaged over all trajectories:

\[\begin{align} J(\theta) = \mathbb{E}_{\tau \sim p(\tau|\theta)} [\sum_t r(s_t,a_t)] \label{eq:rl_obj} \end{align}\]

Solving the reward maximization problem

Above problem is an unconstrained non-convex optimization problem. A simple approach would be to iteratively take steps in the gradient direction. Let’s compute its gradient. Let $G(\tau) = \sum_t r(s_t,a_t)$ represent the reward along a trajectory $\tau$. We take the gradient with respect to policy neural net parameters $\theta$:

\[\begin{align*} \nabla_\theta J(\theta) = \nabla_\theta \int_{\tau} P(\tau|\theta) G(\tau) d\tau = \int_{\tau} \nabla_\theta P(\tau|\theta) G(\tau) d\tau \end{align*}\]

We can rewrite $\nabla_\theta P(\tau\vert\theta)$ as $P(\tau\vert\theta) \nabla_\theta \log P(\tau\vert\theta)$ using the Log-Derivative Trick ($\nabla x = x \nabla \log x$):

\[\begin{align} \nabla_\theta J(\theta) = \int P(\tau|\theta) \nabla_\theta \log P(\tau|\theta) G(\tau) d\tau = \mathbb{E}_{\tau} [\nabla_\theta \log P(\tau|\theta) G(\tau)] \label{eq:rl_obj_deriv1} \end{align}\]

Let’s look at the distribution of trajectories. The probability of a trajectory $\tau = (s_0, a_0, s_1, a_1, \dots)$ is a joint distribution over all the states and actions in that path. We assume the Markov property for transitioned states, i.e. $s_{t+1}$ depends only on previous state $s_t$ and action $a_t$. Also, the action $a_t$ depends only on $s_t$. $\pi_\theta(a_t\vert s_t)$ represents the policy which is a distribution over actions given the state. Indeed, this distribution is what we are trying to model. Using the above, the path probability is expanded as:

\[\begin{align*} \label{eq:prob_tau} P(\tau|\theta) &= p(s_0) \prod_{t=0}^{T} \pi_\theta(a_t|s_t) p(s_{t+1}|s_t, a_t) \\ \log P(\tau|\theta) &= \log p(s_0) + \sum_{t=0}^{T} \log \pi_\theta(a_t|s_t) + \sum_{t=0}^{T} \log P(s_{t+1}|s_t, a_t) \end{align*}\]

Taking the gradient $\nabla_\theta$ makes the initial state and transition dynamics ($P$) vanish, as they do not depend on $\theta$. This implies that decision policy is independent of environment dynamics:

\[\begin{align*} \nabla_\theta \log P(\tau|\theta) = \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \end{align*}\]

Plugging this back into eq $\eqref{eq:rl_obj_deriv1}$:

\[\begin{align*} \nabla_\theta J(\theta) = \mathbb{E}_{\tau} \left[ G(\tau) \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \right] \end{align*}\]

Inside the expectation, we see $G(\tau)$ serves as a weight, implying paths with higher rewards are more important for updating the policy.

Rewards-to-go: Since an action at time $t$ cannot affect past rewards, we can ignore the past reward terms in $G(\tau)$. We move $G(\tau)$ inside the summation and replace it with the future return $G_t$. It can be shown that the expectation of past terms is $0$ (shown by splitting $G(\tau)$ into two sums: till $t-1$ and rest of the steps i.e. $t$ and onward). We skip that algebra here.

\[\begin{align*} \nabla_\theta J(\theta) &= \mathbb{E}_{\tau} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) G_\tau \right]\\ &= \mathbb{E}_{\tau} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \sum_{t' = t}^T r(s_t', a_t') \right] \\ &= \sum_{t=0}^{T} \mathbb{E}_{\tau} \left[\nabla_\theta \log \pi_\theta(a_t|s_t) G_t \right] \end{align*}\]

Using the Law of Iterated Expectations, we further split the expectation into the “current state-action” and the “future trajectory” $\tau_{>t}$. First, the law of iterated expectations is:

\[\begin{align*} E_{X,Y}[f(X,Y)] &= \int_X \int_Y f(X,Y) p(X,Y) \\ &= \int_X \int_Y f(X,Y) p(X) p(Y|X) \\ &= \int_X p(X) \int_Y f(X,Y) p(Y|X) = p(X) E_{Y|X}[f(X,Y)]\\ E_{X,Y}[f(X,Y)] &= E_XE_{Y|X}[f(X,Y)] \end{align*}\] \[\begin{align*} \nabla_\theta J(\theta) = \sum_{t=0}^{T} \mathbb{E}_{s_t, a_t} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot \mathbb{E}_{\tau_{>t} | s_t, a_t} [G_t] \right] \label{eq:reinforce} \end{align*}\]

$E_{\tau_{>t} \vert s_t, a_t} [G_t]$ is the total future reward including reward at current state $r(s_t,a_t)$. Practically, when we run the algorithm, we need to estimate the expectation, we can run the policy $N$ times and compute future reward at each $(s_t, a_t)$ tuple. Using this we can compute the gradient and update the policy.

\[\begin{align*} \label{eq:grad_estimate} \nabla_\theta J(\theta) \approx \sum_{i=0}^{N} \sum_{t=0}^{T} \left[ \nabla_\theta \log \pi_\theta(a_{it}|s_{it}) G_t \right] \end{align*}\]

Problems with gradient estimation

The biggest concern with the above gradient estimator is high variance. Let’s focus on a single term of the estimate and compute its variance. Since $G_t$ is the total reward, let’s assume, for the sake of argument, we can treat it like a constant, then we can write the variance as

\[\begin{align} \label{eq:var_policy_grad_1} Var(\nabla_\theta \log \pi_\theta(a_t|s_t) G_t) &= G_t^2 \left(E[(\nabla_\theta \log \pi_\theta(a_t|s_t))^2] - E[\nabla_\theta \log \pi_\theta(a_t|s_t)]^2\right) \end{align}\]

Notice how the variance blows up if the value of $G_t$ is large. Large total rewards magnify the variance in log policy gradients.

Further, $G_t$ is an RV and is itself a source of variance:

\[\begin{align*} \label{eq:var_policy_grad_2} Var(\nabla_\theta \log \pi_\theta(a_t|s_t) G_t) &\geq Var(G_t) Var\left(\nabla_\theta \log \pi_\theta(a_t|s_t)\right) \end{align*}\]

Intuitively we can understand that $G_t$ has high variance because even for the same action, the total reward can vary wildly. If at step $t$ you make a brilliant move but down the line you make a terrible move, your total reward can be lower than the case where you make a terrible move at $t$ but correct your mistake later. Now, in expectation, all this noise will clear out, but since we deal with finite samples, we get high variance in the gradient estimate values.

Credit Assignment Problem: We would like the gradient update to be able to discern between a good action and a bad action. Ideally a good action should increase the gradient and a bad action should reduce it i.e. they should have opposite signs. However, since the gradients are multiplied by high variance total rewards, even bad actions can lead to large positive total future reward. For example, a bad move corrected later will have high $G_t$ for bad action $a_t$. We can fix these issues to some extent by using different tricks. Before going there, let’s define Q-function.

Q-function

$E_{\tau_{>t} \vert s_t, a_t} [G_t]$ is the total future reward. This reward is represented as Q-function $Q^\pi(s_t, a_t)$. Thus:

\[\begin{align*} \nabla_\theta J(\theta) = \sum_{t=0}^{T} \mathbb{E}_{s_t, a_t} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) Q^\pi(s_t, a_t) \right] \label{eq:qlearning} \end{align*}\]

Note, Q-function includes expected value of all future rewards plus the current reward. If we know the Q function, we can replace the sum of rewards along the trajectory with Q-value at that $s_t$, $a_t$. This will reduce the variance since we have replaced the estimate with the true expectation of future rewards.

Baseline: We can further reduce the variance of the gradient estimate by replacing $Q^\pi(s_t, a_t)$ with $Q^\pi(s_t, a_t) - b$ where $b$ is a baseline. The reason this reduces variance can be understood from eq $\eqref{eq:var_policy_grad_1}$. The term $G_t^2$ will now be replaced by $(G_t-b)^2$, which is a lower quantity if $b$ is wisely chosen. A good baseline is $b = V(s_t) = E_{a_t\vert s_t}[Q(s_t,a_t)]$ i.e., we integrate out the dependence on action. This means that V is the average reward we can get in state $s_t$ for all possible actions. $V(s_t)$ is called the value function. We get:

\[\begin{align} \label{eq:grad_q_v} \nabla_\theta J(\theta) = \sum_{t=0}^{T} \mathbb{E}_{s_t, a_t} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) (Q^\pi(s_t, a_t) - V(s_t)) \right] \end{align}\]

Introducing baseline doesn’t alter the expected value of the gradient, the expectation of added term is $0$:

\[\begin{align*} \nabla_\theta J(\theta) &= \sum_{t=0}^{T} \mathbb{E}_{s_t, a_t} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) Q^\pi(s_t, a_t)\right] - \sum_{t=0}^{T} \mathbb{E}_{s_t, a_t} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) V(s_t)) \right] \end{align*}\] \[\begin{aligned} \sum_{t=0}^{T} \mathbb{E}_{s_t, a_t} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) V(s_t) \right] &= \sum_{t=0}^{T} \mathbb{E}_{s_t} \left[ \mathbb{E}_{a_t|s_t} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) V(s_t) \right] \right] \\ &= \sum_{t=0}^{T} \mathbb{E}_{s_t} \left[ V(s_t) \mathbb{E}_{a_t|s_t} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \right] \right] \\ \\ \text{Inner term goes to zero:} \\ \mathbb{E}_{a_t|s_t} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \right] &= \int \pi_\theta(a_t|s_t) \nabla_\theta \log \pi_\theta(a_t|s_t) \, da_t \\ &= \int \pi_\theta(a_t|s_t) \frac{\nabla_\theta \pi_\theta(a_t|s_t)}{\pi_\theta(a_t|s_t)} \, da_t \\ &= \int \nabla_\theta \pi_\theta(a_t|s_t) \, da_t \\ &= \nabla_\theta \int \pi_\theta(a_t|s_t) \, da_t \\ &= \nabla_\theta (1) = 0 \end{aligned}\]

Advantage Function

The $Q$-function represents total average reward starting at a given state-action pair $s_t$, $a_t$. This includes the deterministic reward $r(s_t,a_t)$. A high $Q(s_t, a_t) $value means the action at state $s_t$ is a good action. If we further integrate out action $a_t$, we get how good the state $s_t$ itself is irrespective of action taken. This is the baseline $V^\pi(s_t)$ and it shows on average how much reward you can get out of this state. A high $V^\pi(s_t)$ implies, it’s a good state to be in on average to get good future reward.

The difference i.e. $Q^\pi(s_t, a_t) - V^\pi(s_t)$ is called advantage and referred to as $A^\pi(s_t,a_t)$. Advantage shows how good an action $a_t$ is in a state $s_t$ compared to an average action in this state. This is a more intuitive way of normalizing rewards for different actions in a given state. Note, the superscript $\pi$ in all the above quantities. This indicates they depend on the policy, since the roll out of future states and actions are policy dependent. It can be shown that

\[\begin{align*} \label{eq:q_v_dependence} Q^\pi(s_t, a_t)&=r(s_t,a_t) + V^{\pi}(s_{t+1}) \end{align*}\] \[\begin{align*} Q^\pi(s_t, a_t)&= \mathbb{E}_{\tau_{>t} | s_t, a_t} [\sum_{t'=t}^T r(s_t',a_t')] \\ &= r(s_t,a_t) + \mathbb{E}_{s_{t+1},a_{t+1},...,s_{T},a_{T} | s_t, a_t} [\sum_{t'=t+1}^T r(s_t',a_t')] \\ &= r(s_t,a_t) + \mathbb{E}_{s_{t+1},a_{t+1},...,s_{T},a_{T}} [\sum_{t'=t+1}^T r(s_t',a_t')] \\ &= r(s_t,a_t) + V^{\pi}(s_{t+1}) \end{align*}\]

In step 3, we can drop the conditional since reward at $t+1$ is independent of state at $t$. The last equation follows from the definition of the value function as expected reward starting at state $t$. Using this in advantage formula:

\[\begin{align*} \label{eq:undisc_adv} A^\pi(s_t, a_t) = r(s_t, a_t) + V^\pi(s_{t+1}) - V^\pi(s_t) \end{align*}\]

The utility of the above equation is that by learning the value function alone, we can estimate $A^\pi(s_t, a_t)$. We can rewrite eq $\eqref{eq:grad_q_v}$ as:

\[\begin{align*} \nabla_\theta J(\theta) = \sum_{t=0}^{T} \mathbb{E}_{s_t, a_t} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) (Q^\pi(s_t, a_t) - V^\pi(s_t)) \right]\\ \nabla_\theta J(\theta) = \sum_{t=0}^{T} \mathbb{E}_{s_t, a_t} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) (A^\pi(s_t, a_t)) \right] \end{align*}\]

Practically we will compute the gradient by rolling out the policy:

\[\begin{align} \label{eq:policy_grad} \nabla_\theta J(\theta) \approx \sum_{i=0}^{N} \sum_{t=0}^{T} \left[ \nabla_\theta \log \pi_\theta(a_{it}|s_{it}) (A^\pi(s_{it}, a_{it})) \right] \end{align}\]

The better the estimate of the advantage function, the lower the variance of the gradient.

Discounted Rewards

For a long-running sequence (large $T$) and positive rewards, the sum of future reward can get arbitrarily high. To avoid this we can discount the future rewards, i.e. give more weight to current reward over future rewards. This is especially useful if there is an end state for the process (you die in a game, go bankrupt etc.), reward now is preferable to reward later. This is incorporated through a discounting factor ($\gamma$). Adding discounting doesn’t change the maths.

\[\begin{align*} \label{eq:dis_q_func} Q^\pi(s_t, a_t)&= \mathbb{E}_{\tau_{>t} | s_t, a_t} [\sum_{t'=t}^T \gamma^{t'-t} r(s_t',a_t')] \\ &= r(s_t,a_t) + \gamma V^{\pi}(s_{t+1})\\ A^\pi(s_t, a_t) &= r(s_t,a_t) + \gamma V^{\pi}(s_{t+1}) - V^{\pi}(s_{t}) \end{align*}\]

Learning Value Function

Value function is estimated by using a neural network that takes state $s$ as input and estimates value function at $s$. It has parameters $\phi$ ($\hat{V}_\phi^\pi$). The function can be learnt by rolling out the policy multiple times (Monte Carlo evaluations) and collecting training data.

Rolling out policy: Start from a state $s_t$, sample from the policy to get $a_t$ ($a_t \sim \pi_\theta(a\vert s_t)$). This gives the pair ${s_t, a_t}$. Pass $a_t$ to the environment (which is not in our control, this is like a black box). The environment through its state transition distribution $P(s_{t+1}\vert s_t, a_t)$ gives the next state $s_{t+1}$ and the reward $r_t$. This is called rolling out the policy. Roll out till completion (till step $T$).

Collecting training data: A single roll out gives training data of size $T$ i.e.

\[\{(s_1, r_1+r_2+\ldots+r_T), (s_2, r_2+\ldots+r_T),\ldots, (s_T,r_T)\}\]

Since the data is from a single roll out it’s correlated and the variance of the sum of rewards is high. This can be overcome by using bootstrapped estimates.

Bootstrapping of Rewards: We can compute the estimate by using the previous fitted value function i.e.

\[\hat{V}^\pi(s_t) \approx r(s_t,a_t) + \hat{V}^\pi(s_{t+1})\]

We use approx since ideally we need expectation over all actions $a_t$ to get value estimate at $s_t$, but since we have a roll out we use $r(s_t,a_t)$ as the point estimate. Also, we are using the previously fitted policy to get the training data to learn updated policy. If the new policy is very different from the previous policy we are learning with the wrong training data. Thus we need to take small steps in the gradient descent of the policy update. The idea that updated policy needs to be close to the previous policy is core to modern RL algorithms like TRPO and PPO. Our training data (including discounting) looks like this:

\[\{(s_1, r_1 + \gamma \hat{V}^\pi(s_{2}) ), (s_2, r_2 + \gamma \hat{V}^\pi(s_{3})),\ldots, (s_T,r_T)\}\]

Bootstrap estimates reduce variance but at the cost of introducing bias (because we are off from the true labels by using old-policy-based value function in data generation). For instance, when we start training from a randomly initialized network, the previous value function is going to be noise. We can reduce bias by introducing more terms in the labelled data, reducing bias but increasing variance.

\[\{(s_1, r_1 + \gamma r_2 + \gamma^2 \hat{V}^\pi(s_{3}) ), (s_2, r_2 + \gamma r_3 + \gamma^2 \hat{V}^\pi(s_{4})),\ldots, \}\]

Value Function Model Training: We can train a neural-network to learn on this data. Mean squared loss can be used as the loss function.

\[\begin{align*} \label{eq:a2c_vf_loss} L^{VF}(\phi) &= \frac{1}{T} \sum_T(V(\phi) - V_{target})^2 \end{align*}\]

Vanilla Policy Gradient (VPG) and Advantage Actor-Critic (A2C)

Learning the value function forms the backbone of A2C methods. In the batch A2C, we collect data via roll out of policy, fit the value function, and then update the policy as:

  1. For each training point estimate advantage: \(\begin{align*} \label{eq:disc_adv} \hat{A}^\pi(s_t, a_t) = r_t + \gamma \hat{V}^\pi(s_{t+1})-\hat{V}^\pi (s_t) \end{align*}\)
  2. The gradient estimate is computed using equation $\eqref{eq:policy_grad}$ where $N$ is multiple roll-outs of the policy
  3. Finally, update policy parameters $\theta$ using the gradient step.

Practically, we use the existing auto-diff libs like PyTorch for the above optimization. Note that the original cost function we used was the expected value of sum of future rewards (eq $\eqref{eq:rl_obj}$). For using auto-diff we need to formulate a cost function whose gradient estimate becomes eq $\eqref{eq:policy_grad}$. Such objective can simply be written as (-ve since we minimize objectives by default in ML):

\[\begin{align*} J(\theta) = -\sum_{i=0}^{T} \log \pi_\theta(a_t|s_t)\hat{A}^\pi(s_t, a_t) \end{align*}\]

We are down to solving the unconstrained optimization problem:

\[\begin{align*} \min_{\theta} J(\theta) \end{align*}\]

where policy $\pi_\theta(a_t\vert s_t)$ is represented by a neural network. We can now use SGD/ADAM to optimize it using PyTorch. I have found these two implementations instructional: SpinUpRL and baselines, both from OpenAI.

Difference between VPG and A2C: There is no single definition of these algorithms. VPG and A2C both spin up $N$ parallel agents, each independently running a policy rollout. Both synchronize the update step by combining the data from all the parallel rollouts. While VPG rolls out the policy till completion for each parallel worker, A2C runs a small number of steps (say 5 steps) per worker and then updates the actor and the critic. Next update step doesn’t reset the environment, but starts at the last state for each parallel worker. Total reward used for advantage estimation in each step are bootstrap estimates and use all available rewards from the next 5 steps.

Another difference is VPG uses two independent models, one for actor (i.e. policy) and one for critic (i.e. value function). A2C uses a single network i.e. shared weights but last layer is a fully connected layer with a single node for the critic, and distribution-parameter nodes for the actor (see the figure below). The loss function for such a common network is represented by a weighted loss:

\[\begin{align*} \min_{\theta} J(\theta) + c_{vf} L^{VF}(\theta) \end{align*}\]
Shared network for value function and action distribution

Both the above methods are examples of on-policy RL algorithms where the algorithm learns from its current policy. It’s a bit inefficient if you think about it — to make one update to the policy we throw away all the data we collected for previous updates. We can modify the above on-policy algorithm to an off-policy version but it’s not straightforward. In the next posts we will discuss some advanced policy gradient methods like TRPO and PPO, which overcome the challenges of A2C.