See also: LLMs
ELI5: Mom often creates a food list consists of of items to buy. Your job is to guess what the last item on this list would be.
Most implementations are autoregressive. Most major SOTA are decoder-only, as encoder-decoder models has lack behind due to their expensive encoding phase.
See this amazing visualisation from Brendan Bycroft
Currently, there is a rise for state-space models which shows promise in information-dense data