profile pic
⌘ '
raccourcis clavier

softmax

softmax(y)i=eyiieyi\text{softmax(y)}_i = \frac{e^{y_i}}{\sum_{i} e^{y_i}}

where yRky \in \mathbb{R}^k

exp()

see also (Abdelkhalik et al., 2022), RDNA3 instruction sets of V_LDEXP_F32

Usually a lot better comparing to 2**t simply for numerical stability reasons

For ARM the design specially instructions set for it!

pseudocode-exp-fexpa.cpp
// Pseudocode representing the computation flow:
float32x4_t exp_sve2(float32x4_t x) {
    // Step 1: Range reduction
    // N = round(x * log2(e))
    // r = x - N * ln(2)    [reduced argument]
 
    // Step 2: FEXPA instruction provides 2^N approximation
    // In hardware: FEXPA Z0.S, Z1.S
    float32x4_t exp_approx; // Result of FEXPA
 
    // Step 3: Polynomial evaluation for exp(r)
    // Typically uses Horner's method with reduced precision
    // coefficients since we're starting with a good approximation
    float32x4_t exp_r = evaluate_polynomial(r);
 
    // Step 4: Combine results
    return exp_approx * exp_r;
}

Advantages of FEXPA:

  • single instruction latency for initial approximation
  • vectorized ops for batch processing

On GPU we can utilise bit-shift 1<<x or CUDA’s exp2

Optimization in llama.cpp: ggerganov/llama.cpp#7154

sigmoid

sigmoid(x)=11+ex\text{sigmoid}(x) = \frac{1}{1+e^{-x}}

ReLU

FFN(x,W1,W2,b1,b2)=max(0,xW1+b1)W2+b2\text{FFN}(x, W_{1}, W_{2}, b_{1}, b_{2}) = max(0, xW_{1}+b_{1})W_{2} + b_{2}

A version in T5 without bias:

FFNReLU(x,W1,W2)=max(xW1,0)W2\text{FFN}_\text{ReLU}(x,W_{1},W_{2}) = max(xW_{1},0)W_{2}

Swish

(Ramachandran et al., 2017) introduces an alternative to ReLU that works better on deeper models across different tasks.

f(x)=xsigmoid(βx)β: constant parameterf(x) = x \cdotp \text{sigmoid}(\beta x) \\ \because \beta : \text{ constant parameter}

Gated Linear Units and Variants

component-wise product of two linear transformations of the inputs, one of which is sigmoid-activated.

(Shazeer, 2020) introduces a few GELU activations to yield improvements in Transformers architecture.

GLU(x,W,V,b,c)=σ(xW+b)(xV+c)Bilinear(x,W,V,b,c)=(xW+b)(xV+c)\begin{aligned} \text{GLU}(x,W,V,b,c) &= \sigma(xW+b) \otimes (xV+c) \\ \text{Bilinear}(x,W,V,b,c) &= (xW+b) \otimes (xV+c) \end{aligned}

GLU in other variants:

ReGLU(x,W,V,b,c)=max(0,xW+b)(xV+c)GEGLU(x,W,V,b,c)=GELU(xW+b)(xV+c)SwiGLU(x,W,V,b,c)=Swishβ(xW+b)(xV+c)\begin{aligned} \text{ReGLU}(x,W,V,b,c) &= \max(0, xW+b) \otimes (xV+c) \\ \text{GEGLU}(x,W,V,b,c) &= \text{GELU}(xW+b) \otimes (xV+c) \\ \text{SwiGLU}(x,W,V,b,c) &= \text{Swish}_\beta(xW+b) \otimes (xV+c) \end{aligned}

FFN for transformers layers would become:

FFNGLU(x,W,V,W2)=(σ(xW)xV)W2FFNBilinear(x,W,V,W2)=(xWxV)W2FFNReGLU(x,W,V,W2)=(max(0,xW)xV)W2FFNGEGLU(x,W,V,W2)=(GELU(xW)xV)W2FFNSwiGLU(x,W,V,W2)=(Swishβ(xW)xV)W2\begin{aligned} \text{FFN}_\text{GLU}(x,W,V,W_{2}) &= (\sigma (xW) \otimes xV)W_{2} \\ \text{FFN}_\text{Bilinear}(x,W,V,W_{2}) &= (xW \otimes xV)W_{2} \\ \text{FFN}_\text{ReGLU}(x,W,V,W_{2}) &= (\max(0, xW) \otimes xV)W_{2} \\ \text{FFN}_\text{GEGLU}(x,W,V,W_{2}) &= (\text{GELU}(xW) \otimes xV)W_{2} \\ \text{FFN}_\text{SwiGLU}(x,W,V,W_{2}) &= (\text{Swish}_\beta(xW) \otimes xV)W_{2} \end{aligned}

note: reduce number of hidden units dffd_\text{ff} (second dimension of WW and VV and the first dimension of W2W_{2}) by a factor of 23\frac{2}{3} when comparing these layers

JumpReLU

(Erichson et al., 2019)

application: Gated SAE (Rajamanoharan et al., 2024)

J(z)zH(zκ)={0if zκzif z>κJ(z) \coloneqq z H(z - \kappa) = \begin{cases} 0 & \text{if } z \leq \kappa \\ z & \text{if } z > \kappa \end{cases}

momentum

See also SGD, Cornell’s CS6787, gradient descent

In the case of quadratic function: f(x)=12x2f(x) = \frac{1}{2} x^2, then xt+1=xtαxt=(1α)xtx_{t+1} = x_t - \alpha x_t = (1-\alpha)x_t

Think of convergence rate

xt+10=1αxt0\mid x_{t+1} - 0 \mid = \mid 1 - \alpha \mid \mid x_t - 0 \mid

If we set different curvature (f(x)=2x2f(x) = 2x^2) thus xt+1=xt4αxt=(14α)xtx_{t+1} = x_t - 4 \alpha x_t = (1-4 \alpha)x_t

step size

step size depends on curvature for one-dimensional quadratics

more curvature means smaller ideal step size

how would this play for general quadratics?

for PSD symmetric AA

f(x)=12xTAxf(x) = \frac{1}{2} x^T Ax

with gradient descent has update step

xt+1=xtαAxt=(IαA)xtx_{t+1} = x_t - \alpha A x_t = (I - \alpha A)x_t

convergence rate would be

maxx(IαA)xx=maxx1x(Iαi=1nλiuiuiT)x=maxxi=1n(1αλi)uiuiTxi=1nuiuiTx=maxi1αλi=max(1αλmin,αλmax1)\begin{aligned} \max_{x} \frac{\|(I - \alpha A)x\|}{\|x\|} &= \max_{x} \frac{1}{\|x\|} \left\| \left( I - \alpha \sum_{i=1}^{n} \lambda_i u_i u_i^T \right) x \right\| \\[8pt] &= \max_{x} \frac{\|\sum_{i=1}^{n} (1- \alpha \lambda_i) u_i u_i^T x\|}{\|\sum_{i=1}^{n} u_i u_i^T x\|} \\ &= max_i \mid 1- \alpha \lambda_i \mid \\ &=max(1-\alpha \lambda_{\text{min}}, \alpha \lambda_{\text{max}} - 1) \end{aligned}

optimal convergence rate

optimal value occurs when

1αλmin=αλmax1α=2λmax+λmin1 - \alpha \lambda_{\text{min}} = \alpha \lambda_{\text{max}} - 1 \Rightarrow \alpha = \frac{2}{\lambda_{\text{max}} + \lambda_{\text{min}}}

with rate

λmaxλminλmax+λmin\frac{\lambda_{\text{max}} - \lambda_{\text{min}}}{\lambda_{\text{max}} + \lambda_{\text{min}}}

We denote κ=λmaxλmin\kappa = \frac{\lambda_{\text{max}}}{\lambda_{\text{min}}} as condition number of matrix A

poorly conditioned

Problems with larger condition numbers converge slower.

Intuitively these are problems that are highly curved in some directions, but flat others

Polyak

abbreviation: “heavy ball method”

idea: add an extra momentum term to gradient descent

xt+1=xtαf(xt)+β(xtxt1)x_{t+1} = x_t - \alpha \nabla f(x_t) + \beta (x_t - x_{t-1})

tl/dr: if current gradient step is in same direction as previous step, then move a little further in the same direction

Nesterov

See also paper, momentum

idea:

  • first take a step in the direction of accumulated momentum
  • computes gradient at “lookahead” position,
  • make the update using this gradient.

definition

For a parameter vector θ\theta, the update can be expressed as

vt=μvt1+L(θt+μvt1)θt+1=θtαvt\begin{aligned} v_t &= \mu v_{t-1} + \nabla L(\theta_t + \mu v_{t-1}) \\ \theta_{t+1} &= \theta_t - \alpha v_t \end{aligned}

Achieves better convergence rates

function typegradient descentNesterove AG
Smoothθ(1T)\theta(\frac{1}{T})θ(1T2)\theta(\frac{1}{T^{2}})
Smooth & Strongly Convexθ(exp(Tκ))\theta(\exp (-\frac{T}{\kappa}))θ(expTκ)\theta(\exp -\frac{T}{\sqrt{\kappa}})

optimal assignments for parameters

α=1λmax,β=κ1κ+1\alpha = \frac{1}{\lambda_{\text{max}}}, \beta = \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}
Lien vers l'original

Bibliographie

  • Abdelkhalik, H., Arafa, Y., Santhi, N., & Badawy, A.-H. (2022). Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis. arXiv preprint arXiv:2208.11174 [arxiv]
  • Erichson, N. B., Yao, Z., & Mahoney, M. W. (2019). JumpReLU: A Retrofit Defense Strategy for Adversarial Attacks. arXiv preprint arXiv:1904.03750 [arxiv]
  • Rajamanoharan, S., Lieberum, T., Sonnerat, N., Conmy, A., Varma, V., Kramár, J., & Nanda, N. (2024). Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders. arXiv preprint arXiv:2407.14435 [arxiv]
  • Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for Activation Functions. arXiv preprint arXiv:1710.05941 [arxiv]
  • Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv preprint arXiv:2002.05202 [arxiv]