A list of optimization functions that can be used in ML training to reduce loss.

sigmoid

ReLU

A version in T5 without bias:

Swish

(Ramachandran et al., 2017) introduces an alternative to ReLU that works better on deeper models across different tasks.

Gated Linear Units and Variants

component-wise product of two linear transformations of the inputs, one of which is sigmoid-activated.

(Shazeer, 2020) introduces a few GELU activations to yield improvements in Transformers architecture.

GLU in other variants:

FFN for transformers layers would become:

_note: reduce number of hidden units (second dimension of and and the first dimension of ) by a factor of when comparing these layers

Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for Activation Functions. arXiv preprint arXiv:1710.05941 arxiv
Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv preprint arXiv:2002.05202 arxiv