A list of optimization functions that can be used in ML training to reduce loss.

sigmoid

sigmoid(x)=11+ex\text{sigmoid}(x) = \frac{1}{1+e^{-x}}

ReLU

FFN(x,W1,W2,b1,b2)=max(0,xW1+b1)W2+b2\text{FFN}(x, W_{1}, W_{2}, b_{1}, b_{2}) = max(0, xW_{1}+b_{1})W_{2} + b_{2}

A version in T5 without bias:

FFNReLU(x,W1,W2)=max(xW1,0)W2\text{FFN}_\text{ReLU}(x,W_{1},W_{2}) = max(xW_{1},0)W_{2}

Swish

(Ramachandran et al., 2017) introduces an alternative to ReLU that works better on deeper models across different tasks.

f(x)=xsigmoid(βx)β: constant parameterf(x) = x \cdotp \text{sigmoid}(\beta x) \\ \because \beta : \text{ constant parameter}

Gated Linear Units and Variants

component-wise product of two linear transformations of the inputs, one of which is sigmoid-activated.

(Shazeer, 2020) introduces a few GELU activations to yield improvements in Transformers architecture.

GLU(x,W,V,b,c)=σ(xW+b)(xV+c)Bilinear(x,W,V,b,c)=(xW+b)(xV+c)\begin{aligned} \text{GLU}(x,W,V,b,c) &= \sigma(xW+b) \otimes (xV+c) \\ \text{Bilinear}(x,W,V,b,c) &= (xW+b) \otimes (xV+c) \end{aligned}

GLU in other variants:

ReGLU(x,W,V,b,c)=max(0,xW+b)(xV+c)GEGLU(x,W,V,b,c)=GELU(xW+b)(xV+c)SwiGLU(x,W,V,b,c)=Swishβ(xW+b)(xV+c)\begin{aligned} \text{ReGLU}(x,W,V,b,c) &= \max(0, xW+b) \otimes (xV+c) \\ \text{GEGLU}(x,W,V,b,c) &= \text{GELU}(xW+b) \otimes (xV+c) \\ \text{SwiGLU}(x,W,V,b,c) &= \text{Swish}_\beta(xW+b) \otimes (xV+c) \end{aligned}

FFN for transformers layers would become:

FFNGLU(x,W,V,W2)=(σ(xW)xV)W2FFNBilinear(x,W,V,W2)=(xWxV)W2FFNReGLU(x,W,V,W2)=(max(0,xW)xV)W2FFNGEGLU(x,W,V,W2)=(GELU(xW)xV)W2FFNSwiGLU(x,W,V,W2)=(Swishβ(xW)xV)W2\begin{aligned} \text{FFN}_\text{GLU}(x,W,V,W_{2}) &= (\sigma (xW) \otimes xV)W_{2} \\ \text{FFN}_\text{Bilinear}(x,W,V,W_{2}) &= (xW \otimes xV)W_{2} \\ \text{FFN}_\text{ReGLU}(x,W,V,W_{2}) &= (\max(0, xW) \otimes xV)W_{2} \\ \text{FFN}_\text{GEGLU}(x,W,V,W_{2}) &= (\text{GELU}(xW) \otimes xV)W_{2} \\ \text{FFN}_\text{SwiGLU}(x,W,V,W_{2}) &= (\text{Swish}_\beta(xW) \otimes xV)W_{2} \end{aligned}

_note: reduce number of hidden units dffd_\text{ff} (second dimension of WW and VV and the first dimension of W2W_{2}) by a factor of 23\frac{2}{3} when comparing these layers

momentum

See also Stochastic gradient descent

Nesterov momentum

See also paper

idea:

  • first take a step in the direction of accumulated momentum
  • computes gradient at “lookahead” position,
  • make the update using this gradient.

definition

For a parameter vector θ\theta, the update can be expressed as

vt=μvt1+L(θt+μvt1)θt+1=θtαvt\begin{aligned} v_t &= \mu v_{t-1} + \nabla L(\theta_t + \mu v_{t-1}) \\ \theta_{t+1} &= \theta_t - \alpha v_t \end{aligned}

Achieves better convergence rates

function typegradient descentNesterove AG
Smoothθ(1T)\theta(\frac{1}{T})θ(1T2)\theta(\frac{1}{T^{2}})
Smooth & Strongly Convexθ(exp(Tκ))\theta(\exp (-\frac{T}{\kappa}))θ(expTκ)\theta(\exp -\frac{T}{\sqrt{\kappa}})
Lien vers l'original

Polyak’s Momentum

References

  • Rajamanoharan, S., Lieberum, T., Sonnerat, N., Conmy, A., Varma, V., Kramár, J., & Nanda, N. (2024). Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders. arXiv preprint arXiv:2407.14435 arxiv
  • Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for Activation Functions. arXiv preprint arXiv:1710.05941 arxiv
  • Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv preprint arXiv:2002.05202 arxiv