A list of optimization functions that can be used in ML training to reduce loss.
sigmoid
ReLU
A version in T5 without bias:
Swish
(Ramachandran et al., 2017) introduces an alternative to ReLU that works better on deeper models across different tasks.
Gated Linear Units and Variants
component-wise product of two linear transformations of the inputs, one of which is sigmoid-activated.
(Shazeer, 2020) introduces a few GELU activations to yield improvements in Transformers architecture.
GLU in other variants:
FFN for transformers layers would become:
_note: reduce number of hidden units (second dimension of and and the first dimension of ) by a factor of when comparing these layers
JumpReLU
momentum
See also Stochastic gradient descent
Nesterov momentum
See also paper
idea:
- first take a step in the direction of accumulated momentum
- computes gradient at “lookahead” position,
- make the update using this gradient.
definition
For a parameter vector , the update can be expressed as
Achieves better convergence rates
Lien vers l'original
function type gradient descent Nesterove AG Smooth Smooth & Strongly Convex
Polyak’s Momentum
References
- Rajamanoharan, S., Lieberum, T., Sonnerat, N., Conmy, A., Varma, V., Kramár, J., & Nanda, N. (2024). Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders. arXiv preprint arXiv:2407.14435 arxiv
- Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for Activation Functions. arXiv preprint arXiv:1710.05941 arxiv
- Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv preprint arXiv:2002.05202 arxiv