A list of optimization functions that can be used in ML training to reduce loss.
sigmoid
ReLU
A version in T5 without bias:
Swish
(Ramachandran et al., 2017) introduces an alternative to ReLU that works better on deeper models across different tasks.
Gated Linear Units and Variants
component-wise product of two linear transformations of the inputs, one of which is sigmoid-activated.
(Shazeer, 2020) introduces a few GELU activations to yield improvements in Transformers architecture.
GLU in other variants:
FFN for transformers layers would become:
_note: reduce number of hidden units (second dimension of and and the first dimension of ) by a factor of when comparing these layers