--- created: '2025-12-17' date: '2025-12-17' description: Class of reinforcement learning algorithms. id: Policy gradient modified: 2026-06-05 15:08:06 GMT-04:00 published: '2017-03-01' seealso: - '[[thoughts/Reinforcement learning]]' socials: lillog: https://lilianweng.github.io/posts/2018-04-08-policy-gradient/ wikipedia: https://en.wikipedia.org/wiki/Policy_gradient_method tags: - ml - rl - training title: Policy gradient pageLayout: default slug: thoughts/Policy-gradient permalink: https://aarnphm.xyz/thoughts/Policy-gradient.md generator: quartz: v4.6.0 hostedProvider: Cloudflare baseUrl: aarnphm.xyz full: https://aarnphm.xyz/llms-full.txt --- a sub-class of policy optimization methods. Unlike value-based methods (which learn a value function to derive a policy), policy optimization methods directly learn a policy function $\pi$ that selects actions without consulting a value function. For policy gradient to apply, the policy function is parameterized as a differentiable $\pi_\theta$ with parameters $\theta$. ## overview In policy-based RL, the **actor** is a parameterized policy function $\pi_\theta$, where $\theta$ are the parameters of the actor. The actor takes the state $s$ and produces a probability distribution $$ \pi_\theta(\cdot\mid s) $$ - Discrete actions: $\sum_a \pi_\theta(a\mid s)=1$ - Continuous actions: $\int_a \pi_\theta(a\mid s)\,\mathrm da=1$ The goal of policy optimization is to find $\theta$ that maximizes the expected episodic reward $J(\theta)=\mathbb E_{\pi_\theta}\left[\sum_{t\in 0:T}\gamma^t R_t\;\Big|\;S_0=s_0\right]$ where: - $\gamma$ is the discount factor, - $R_t$ is the reward at step $t$, - $s_0$ is the starting state, - $T$ is the time horizon (possibly infinite). The **policy gradient** is $\nabla_\theta J(\theta)$ Different policy gradient methods stochastically estimate $\nabla_\theta J(\theta)$ in different ways. The goal is to iteratively maximize $J(\theta)$ by **gradient ascent**; since the key part is stochastic estimation, these are also studied under “Monte Carlo gradient estimation”. Topics/papers: - Deep reinforcement learning - Actor-critic method - \[@sutton1999policygradientmethodsreinforcement\] - \[@mohamed2020montecarlogradientestimation\] - \[@williams1992simplestatisticalgradientfollowing\] - \[@stiennon2020learningsummarizehumanfeedback\] - \[@shani2019adaptivetrustregionpolicy\] --- ## REINFORCE The **REINFORCE algorithm** \[@williams1992simplestatisticalgradientfollowing\] was the first policy gradient method. It is based on the identity $$ \nabla_\theta J(\theta)=\mathbb E_{\pi_\theta}\left[\sum_{t\in 0:T}\nabla_\theta\ln\pi_\theta(A_t\mid S_t)\;\sum_{t\in 0:T}(\gamma^t R_t)\;\Big|\;S_0=s_0\right], $$ which can be improved via the **causality trick**: $$ \nabla_\theta J(\theta)=\mathbb E_{\pi_\theta}\left[\sum_{t\in 0:T}\nabla_\theta\ln\pi_\theta(A_t\mid S_t)\;\sum_{\tau\in t:T}(\gamma^\tau R_\tau)\;\Big|\;S_0=s_0\right]. $$ > \[!lemma\] Lemma 1. > > The expectation of the score function is zero, conditional on any present or past state. For any $0\le i\le j\le T$ and any state $s_i$, > > $$ > \mathbb E_{\pi_\theta}\left[\nabla_\theta\ln\pi_\theta(A_j\mid S_j)\;\Big|\;S_i=s_i\right]=0. > $$ > > Further, if $\Psi_i$ is a random variable independent of $A_i,S_{i+1},A_{i+1},\dots$, then > > $$ > \mathbb E_{\pi_\theta}\left[\nabla_\theta\ln\pi_\theta(A_j\mid S_j)\cdot\Psi_i\;\Big|\;S_i=s_i\right]=0. > $$ _Proof sketch (as stated)_: use the log-derivative trick (score function trick). Applying the log-derivative trick: $$ \begin{aligned} \nabla_\theta J(\theta) &=\nabla_\theta\,\mathbb E_{\pi_\theta}\left[\sum_{i\in 0:T}\gamma^i R_i\;\Big|\;S_0=s_0\right]\\ &=\mathbb E_{\pi_\theta}\left[\left(\sum_{i\in 0:T}\gamma^i R_i\right)\nabla_\theta\ln\big(\pi_\theta(A_0,\dots,A_T\mid S_0,\dots,S_T)\big)\;\Big|\;S_0=s_0\right]\\ &=\mathbb E_{\pi_\theta}\left[\left(\sum_{i\in 0:T}\gamma^i R_i\right)\sum_{j\in 0:T}\nabla_\theta\ln\big(\pi_\theta(A_j\mid S_j)\big)\;\Big|\;S_0=s_0\right]\\ &=\mathbb E_{\pi_\theta}\left[\sum_{i,j\in 0:T}(\gamma^i R_i)\,\nabla_\theta\ln\pi_\theta(A_j\mid S_j)\;\Big|\;S_0=s_0\right]. \end{aligned} $$ By the lemma, for any $0\le i0,\\ \max\left(\frac{\pi_\theta(a\mid s)}{\pi_{\theta_t}(a\mid s)},\,1-\epsilon\right)A^{\pi_{\theta_t}}(s,a) & \text{if }A^{\pi_{\theta_t}}(s,a)<0. \end{cases} \right]. $$ PPO performs multiple optimization steps on the same batch, keeping $\theta$ proximal to $\theta_t$ to remain effectively on-policy. If a reference policy $\pi_{\text{ref}}$ is used, an additional KL penalty may be added: $$ -\beta\,\mathbb E_{s,a\sim \pi_{\theta_t}}\left[\log\left(\frac{\pi_\theta(a\mid s)}{\pi_{\text{ref}}(a\mid s)}\right)\right], $$ or equivalently: $$ -\beta\,\mathbb E_{s,a\sim \pi_{\theta_t}}\left[\log\left(\frac{\pi_\theta(a\mid s)}{\pi_{\text{ref}}(a\mid s)}\right)+\frac{\pi_{\text{ref}}(a\mid s)}{\pi_\theta(a\mid s)}-1\right]. $$ ### Group Relative Policy Optimization (GRPO) **GRPO** is a PPO variant that omits the value function estimator $V$ \[@shao2024deepseekmathpushinglimitsmathematical\]. For each state $s$, sample $G$ actions $a_1,\dots,a_G\sim \pi_{\theta_t}$ and compute the group-relative advantage $$ A^{\pi_{\theta_t}}(s,a_j)=\frac{r(s,a_j)-\mu}{\sigma}, $$ where $\mu,\sigma$ are the mean and standard deviation of $r(s,a_1),\dots,r(s,a_G)$. Then maximize the PPO objective averaged across actions: $$ \max_\theta\;\frac1G\sum_{i=1}^G\mathbb E_{(s,a_1,\dots,a_G)\sim \pi_{\theta_t}}\left[ \begin{cases} \min\left(\frac{\pi_\theta(a_i\mid s)}{\pi_{\theta_t}(a_i\mid s)},\,1+\epsilon\right)A^{\pi_{\theta_t}}(s,a_i) & \text{if }A^{\pi_{\theta_t}}(s,a_i)>0,\\ \max\left(\frac{\pi_\theta(a_i\mid s)}{\pi_{\theta_t}(a_i\mid s)},\,1-\epsilon\right)A^{\pi_{\theta_t}}(s,a_i) & \text{if }A^{\pi_{\theta_t}}(s,a_i)<0. \end{cases} \right]. $$ --- ## Policy Optimization and the Mirror Descent perspective (MDPO) TRPO, PPO, and natural policy gradient share the idea of updating along the policy gradient while keeping updates stable via a distance to the previous policy. Mirror Descent (proximal optimization) updates $\mathbf x_t$ via \[@nemirovski1983problemcomplexity\]. $$ \mathbf x_{t+1}\in\arg\min_{\mathbf x\in\mathcal C}\;\nabla f(\mathbf x_t)^T(\mathbf x-\mathbf x_t)+\frac1{\eta_t}B_\omega(\mathbf x,\mathbf x_t). $$ This motivates MDPO \[@tomar2020mirrordescentpolicyoptimization\]. With KL as the Bregman divergence: $$ \pi_{t+1}\in\arg\max_\pi\;\mathbb E_{s,a\sim \pi}\big[A^{\pi_t}(s,a)\big]-\frac1{\eta_t}D_{\mathrm{KL}}(\pi\|\pi_t). $$ With parameterized policy $\pi_\theta$: $$ \max_\theta\;L(\theta,\theta_t)=\mathbb E_{s,a\sim \pi_{\theta_t}}\left[\frac{\pi_\theta(a\mid s)}{\pi_{\theta_t}(a\mid s)}A^{\pi_{\theta_t}}(s,a)\right]-\frac1{\eta_t}D_{\mathrm{KL}}(\pi_\theta\|\pi_{\theta_t}). $$ This objective can be used with techniques like PPO clipping; the KL penalty also appears in the original PPO paper.