⌘ '

raccourcis clavier

ml optimization

softmax

\text{softmax(y)}_i = \frac{e^{y_i}}{\sum_{i} e^{y_i}}

where $y \in \mathbb{R}^k$

`exp()`

Usually a lot better comparing to 2**t simply for numerical stability reasons

For ARM the design specially instructions set for it!

pseudocode-exp-fexpa.cpp

// Pseudocode representing the computation flow:
float32x4_t exp_sve2(float32x4_t x) {
    // Step 1: Range reduction
    // N = round(x * log2(e))
    // r = x - N * ln(2)    [reduced argument]
 
    // Step 2: FEXPA instruction provides 2^N approximation
    // In hardware: FEXPA Z0.S, Z1.S
    float32x4_t exp_approx; // Result of FEXPA
 
    // Step 3: Polynomial evaluation for exp(r)
    // Typically uses Horner's method with reduced precision
    // coefficients since we're starting with a good approximation
    float32x4_t exp_r = evaluate_polynomial(r);
 
    // Step 4: Combine results
    return exp_approx * exp_r;
}

Advantages of FEXPA:

single instruction latency for initial approximation
vectorized ops for batch processing

On GPU we can utilise bit-shift 1<<x or CUDA’s exp2

Optimization in llama.cpp: ggerganov/llama.cpp#7154

sigmoid

\text{sigmoid}(x) = \frac{1}{1+e^{-x}}

ReLU

\text{FFN}(x, W_{1}, W_{2}, b_{1}, b_{2}) = max(0, xW_{1}+b_{1})W_{2} + b_{2}

A version in T5 without bias:

\text{FFN}_\text{ReLU}(x,W_{1},W_{2}) = max(xW_{1},0)W_{2}

Swish

(Ramachandran et al., 2017) introduces an alternative to ReLU that works better on deeper models across different tasks.

f(x) = x \cdotp \text{sigmoid}(\beta x) \\ \because \beta : \text{ constant parameter}

Gated Linear Units and Variants

component-wise product of two linear transformations of the inputs, one of which is sigmoid-activated.

(Shazeer, 2020) introduces a few GELU activations to yield improvements in Transformers architecture.

\begin{aligned} \text{GLU}(x,W,V,b,c) &= \sigma(xW+b) \otimes (xV+c) \\ \text{Bilinear}(x,W,V,b,c) &= (xW+b) \otimes (xV+c) \end{aligned}

GLU in other variants:

\begin{aligned} \text{ReGLU}(x,W,V,b,c) &= \max(0, xW+b) \otimes (xV+c) \\ \text{GEGLU}(x,W,V,b,c) &= \text{GELU}(xW+b) \otimes (xV+c) \\ \text{SwiGLU}(x,W,V,b,c) &= \text{Swish}_\beta(xW+b) \otimes (xV+c) \end{aligned}

FFN for transformers layers would become:

\begin{aligned} \text{FFN}_\text{GLU}(x,W,V,W_{2}) &= (\sigma (xW) \otimes xV)W_{2} \\ \text{FFN}_\text{Bilinear}(x,W,V,W_{2}) &= (xW \otimes xV)W_{2} \\ \text{FFN}_\text{ReGLU}(x,W,V,W_{2}) &= (\max(0, xW) \otimes xV)W_{2} \\ \text{FFN}_\text{GEGLU}(x,W,V,W_{2}) &= (\text{GELU}(xW) \otimes xV)W_{2} \\ \text{FFN}_\text{SwiGLU}(x,W,V,W_{2}) &= (\text{Swish}_\beta(xW) \otimes xV)W_{2} \end{aligned}

note: reduce number of hidden units $d_\text{ff}$ (second dimension of $W$ and $V$ and the first dimension of $W_{2}$ ) by a factor of $\frac{2}{3}$ when comparing these layers

JumpReLU

(Erichson et al., 2019)

application: Gated SAE (Rajamanoharan et al., 2024)

J(z) \coloneqq z H(z - \kappa) = \begin{cases} 0 & \text{if } z \leq \kappa \\ z & \text{if } z > \kappa \end{cases}

momentum

In the case of quadratic function: $f(x) = \frac{1}{2} x^2$ , then $x_{t+1} = x_t - \alpha x_t = (1-\alpha)x_t$

Think of convergence rate

\mid x_{t+1} - 0 \mid = \mid 1 - \alpha \mid \mid x_t - 0 \mid

If we set different curvature ( $f(x) = 2x^2$ ) thus $x_{t+1} = x_t - 4 \alpha x_t = (1-4 \alpha)x_t$

step size

step size depends on curvature for one-dimensional quadratics

more curvature means smaller ideal step size

how would this play for general quadratics?

for PSD symmetric $A$

f(x) = \frac{1}{2} x^T Ax

with gradient descent has update step

x_{t+1} = x_t - \alpha A x_t = (I - \alpha A)x_t

convergence rate would be

\begin{aligned} \max_{x} \frac{\|(I - \alpha A)x\|}{\|x\|} &= \max_{x} \frac{1}{\|x\|} \left\| \left( I - \alpha \sum_{i=1}^{n} \lambda_i u_i u_i^T \right) x \right\| \\[8pt] &= \max_{x} \frac{\|\sum_{i=1}^{n} (1- \alpha \lambda_i) u_i u_i^T x\|}{\|\sum_{i=1}^{n} u_i u_i^T x\|} \\ &= max_i \mid 1- \alpha \lambda_i \mid \\ &=max(1-\alpha \lambda_{\text{min}}, \alpha \lambda_{\text{max}} - 1) \end{aligned}

optimal convergence rate

optimal value occurs when
$1 - \alpha \lambda_{\text{min}} = \alpha \lambda_{\text{max}} - 1 \Rightarrow \alpha = \frac{2}{\lambda_{\text{max}} + \lambda_{\text{min}}}$
with rate
$\frac{\lambda_{\text{max}} - \lambda_{\text{min}}}{\lambda_{\text{max}} + \lambda_{\text{min}}}$

We denote $\kappa = \frac{\lambda_{\text{max}}}{\lambda_{\text{min}}}$ as condition number of matrix A

poorly conditioned

Problems with larger condition numbers converge slower.

Intuitively these are problems that are highly curved in some directions, but flat others

Polyak

abbreviation: “heavy ball method”

idea: add an extra momentum term to gradient descent

x_{t+1} = x_t - \alpha \nabla f(x_t) + \beta (x_t - x_{t-1})

tl/dr: if current gradient step is in same direction as previous step, then move a little further in the same direction

momentum for 1D quadratics

$f(x) = \frac{\lambda}{2} x^{2}$
momentum GD gives
$\begin{aligned} x_{t+1} &= x_t - \alpha \lambda x_t + \beta (x_t - x_{t-1}) \\ &= (1+\beta - \alpha \lambda) x_t - \beta x_{t-1} \end{aligned}$
characterizing momentum:

start with $x_{t+1} = (1+\beta -\alpha \lambda) x_t - \beta x_{t-1}$

trick: let $x_t = \beta^{t/2}z_t$

$z_{t+1} = \frac{1 + \beta - \alpha \lambda}{\sqrt{\beta}} z_t - z_{t-1}$
let $u = \frac{1+\beta -\alpha \lambda}{2 \sqrt{\beta}}$ , then
$z_{t+1} = 2 u z_t - z_{t-1}$
degree- $\textbf{t}$ polynomial in $\textbf{u}$

Nesterov

Bibliographie

Abdelkhalik, H., Arafa, Y., Santhi, N., & Badawy, A.-H. (2022). Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis. arXiv preprint arXiv:2208.11174 [arXiv]
Erichson, N. B., Yao, Z., & Mahoney, M. W. (2019). JumpReLU: A Retrofit Defense Strategy for Adversarial Attacks. arXiv preprint arXiv:1904.03750 [arXiv]
Rajamanoharan, S., Lieberum, T., Sonnerat, N., Conmy, A., Varma, V., Kramár, J., & Nanda, N. (2024). Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders. arXiv preprint arXiv:2407.14435 [arXiv]
Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for Activation Functions. arXiv preprint arXiv:1710.05941 [arXiv]
Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv preprint arXiv:2002.05202 [arXiv]

softmax

\text{softmax(y)}_i = \frac{e^{y_i}}{\sum_{i} e^{y_i}}

where $y \in \mathbb{R}^k$

`exp()`

Usually a lot better comparing to 2**t simply for numerical stability reasons

For ARM the design specially instructions set for it!

pseudocode-exp-fexpa.cpp

// Pseudocode representing the computation flow:
float32x4_t exp_sve2(float32x4_t x) {
    // Step 1: Range reduction
    // N = round(x * log2(e))
    // r = x - N * ln(2)    [reduced argument]
 
    // Step 2: FEXPA instruction provides 2^N approximation
    // In hardware: FEXPA Z0.S, Z1.S
    float32x4_t exp_approx; // Result of FEXPA
 
    // Step 3: Polynomial evaluation for exp(r)
    // Typically uses Horner's method with reduced precision
    // coefficients since we're starting with a good approximation
    float32x4_t exp_r = evaluate_polynomial(r);
 
    // Step 4: Combine results
    return exp_approx * exp_r;
}

Advantages of FEXPA:

single instruction latency for initial approximation
vectorized ops for batch processing

On GPU we can utilise bit-shift 1<<x or CUDA’s exp2

Optimization in llama.cpp: ggerganov/llama.cpp#7154

sigmoid

\text{sigmoid}(x) = \frac{1}{1+e^{-x}}

ReLU

\text{FFN}(x, W_{1}, W_{2}, b_{1}, b_{2}) = max(0, xW_{1}+b_{1})W_{2} + b_{2}

A version in T5 without bias:

\text{FFN}_\text{ReLU}(x,W_{1},W_{2}) = max(xW_{1},0)W_{2}

Swish

(Ramachandran et al., 2017) introduces an alternative to ReLU that works better on deeper models across different tasks.

f(x) = x \cdotp \text{sigmoid}(\beta x) \\ \because \beta : \text{ constant parameter}

Gated Linear Units and Variants

component-wise product of two linear transformations of the inputs, one of which is sigmoid-activated.

(Shazeer, 2020) introduces a few GELU activations to yield improvements in Transformers architecture.

\begin{aligned} \text{GLU}(x,W,V,b,c) &= \sigma(xW+b) \otimes (xV+c) \\ \text{Bilinear}(x,W,V,b,c) &= (xW+b) \otimes (xV+c) \end{aligned}

GLU in other variants:

\begin{aligned} \text{ReGLU}(x,W,V,b,c) &= \max(0, xW+b) \otimes (xV+c) \\ \text{GEGLU}(x,W,V,b,c) &= \text{GELU}(xW+b) \otimes (xV+c) \\ \text{SwiGLU}(x,W,V,b,c) &= \text{Swish}_\beta(xW+b) \otimes (xV+c) \end{aligned}

FFN for transformers layers would become:

\begin{aligned} \text{FFN}_\text{GLU}(x,W,V,W_{2}) &= (\sigma (xW) \otimes xV)W_{2} \\ \text{FFN}_\text{Bilinear}(x,W,V,W_{2}) &= (xW \otimes xV)W_{2} \\ \text{FFN}_\text{ReGLU}(x,W,V,W_{2}) &= (\max(0, xW) \otimes xV)W_{2} \\ \text{FFN}_\text{GEGLU}(x,W,V,W_{2}) &= (\text{GELU}(xW) \otimes xV)W_{2} \\ \text{FFN}_\text{SwiGLU}(x,W,V,W_{2}) &= (\text{Swish}_\beta(xW) \otimes xV)W_{2} \end{aligned}

note: reduce number of hidden units $d_\text{ff}$ (second dimension of $W$ and $V$ and the first dimension of $W_{2}$ ) by a factor of $\frac{2}{3}$ when comparing these layers

JumpReLU

(Erichson et al., 2019)

application: Gated SAE (Rajamanoharan et al., 2024)

J(z) \coloneqq z H(z - \kappa) = \begin{cases} 0 & \text{if } z \leq \kappa \\ z & \text{if } z > \kappa \end{cases}

momentum

In the case of quadratic function: $f(x) = \frac{1}{2} x^2$ , then $x_{t+1} = x_t - \alpha x_t = (1-\alpha)x_t$

Think of convergence rate

\mid x_{t+1} - 0 \mid = \mid 1 - \alpha \mid \mid x_t - 0 \mid

If we set different curvature ( $f(x) = 2x^2$ ) thus $x_{t+1} = x_t - 4 \alpha x_t = (1-4 \alpha)x_t$

step size

step size depends on curvature for one-dimensional quadratics

more curvature means smaller ideal step size

how would this play for general quadratics?

for PSD symmetric $A$

f(x) = \frac{1}{2} x^T Ax

with gradient descent has update step

x_{t+1} = x_t - \alpha A x_t = (I - \alpha A)x_t

convergence rate would be

\begin{aligned} \max_{x} \frac{\|(I - \alpha A)x\|}{\|x\|} &= \max_{x} \frac{1}{\|x\|} \left\| \left( I - \alpha \sum_{i=1}^{n} \lambda_i u_i u_i^T \right) x \right\| \\[8pt] &= \max_{x} \frac{\|\sum_{i=1}^{n} (1- \alpha \lambda_i) u_i u_i^T x\|}{\|\sum_{i=1}^{n} u_i u_i^T x\|} \\ &= max_i \mid 1- \alpha \lambda_i \mid \\ &=max(1-\alpha \lambda_{\text{min}}, \alpha \lambda_{\text{max}} - 1) \end{aligned}

optimal convergence rate

optimal value occurs when
$1 - \alpha \lambda_{\text{min}} = \alpha \lambda_{\text{max}} - 1 \Rightarrow \alpha = \frac{2}{\lambda_{\text{max}} + \lambda_{\text{min}}}$
with rate
$\frac{\lambda_{\text{max}} - \lambda_{\text{min}}}{\lambda_{\text{max}} + \lambda_{\text{min}}}$

We denote $\kappa = \frac{\lambda_{\text{max}}}{\lambda_{\text{min}}}$ as condition number of matrix A

poorly conditioned

Problems with larger condition numbers converge slower.

Intuitively these are problems that are highly curved in some directions, but flat others

Polyak

abbreviation: “heavy ball method”

idea: add an extra momentum term to gradient descent

x_{t+1} = x_t - \alpha \nabla f(x_t) + \beta (x_t - x_{t-1})

tl/dr: if current gradient step is in same direction as previous step, then move a little further in the same direction

momentum for 1D quadratics

$f(x) = \frac{\lambda}{2} x^{2}$
momentum GD gives
$\begin{aligned} x_{t+1} &= x_t - \alpha \lambda x_t + \beta (x_t - x_{t-1}) \\ &= (1+\beta - \alpha \lambda) x_t - \beta x_{t-1} \end{aligned}$
characterizing momentum:

start with $x_{t+1} = (1+\beta -\alpha \lambda) x_t - \beta x_{t-1}$

trick: let $x_t = \beta^{t/2}z_t$

$z_{t+1} = \frac{1 + \beta - \alpha \lambda}{\sqrt{\beta}} z_t - z_{t-1}$
let $u = \frac{1+\beta -\alpha \lambda}{2 \sqrt{\beta}}$ , then
$z_{t+1} = 2 u z_t - z_{t-1}$
degree- $\textbf{t}$ polynomial in $\textbf{u}$

Nesterov

See also paper, momentum

idea:

first take a step in the direction of accumulated momentum

computes gradient at “lookahead” position,

make the update using this gradient.

definition

For a parameter vector $\theta$ , the update can be expressed as
$\begin{aligned} v_t &= \mu v_{t-1} + \nabla L(\theta_t + \mu v_{t-1}) \\ \theta_{t+1} &= \theta_t - \alpha v_t \end{aligned}$

Achieves better convergence rates

function type gradient descent Nesterove AG
Smooth $\theta(\frac{1}{T})$ $\theta(\frac{1}{T^{2}})$
Smooth & Strongly Convex $\theta(\exp (-\frac{T}{\kappa}))$ $\theta(\exp -\frac{T}{\sqrt{\kappa}})$

optimal assignments for parameters

$\alpha = \frac{1}{\lambda_{\text{max}}}, \beta = \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}$

Lien vers l'original

function type	gradient descent	Nesterove AG
Smooth	$\theta(\frac{1}{T})$	$\theta(\frac{1}{T^{2}})$
Smooth & Strongly Convex	$\theta(\exp (-\frac{T}{\kappa}))$	$\theta(\exp -\frac{T}{\sqrt{\kappa}})$

Bibliographie

Abdelkhalik, H., Arafa, Y., Santhi, N., & Badawy, A.-H. (2022). Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis. arXiv preprint arXiv:2208.11174 [arXiv]
Erichson, N. B., Yao, Z., & Mahoney, M. W. (2019). JumpReLU: A Retrofit Defense Strategy for Adversarial Attacks. arXiv preprint arXiv:1904.03750 [arXiv]
Rajamanoharan, S., Lieberum, T., Sonnerat, N., Conmy, A., Varma, V., Kramár, J., & Nanda, N. (2024). Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders. arXiv preprint arXiv:2407.14435 [arXiv]
Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for Activation Functions. arXiv preprint arXiv:1710.05941 [arXiv]
Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv preprint arXiv:2002.05202 [arXiv]

ml optimization

Étiquette

publié à

modifié à

durée

source

softmax

`exp()`

sigmoid

ReLU

Swish

Gated Linear Units and Variants

JumpReLU

momentum

Polyak

Nesterov

Bibliographie

Vous pourriez aimer ce qui suit

Liens retour