See also: LLMs

ELI5: Mom often creates a food list consists of of items to buy. Your job is to guess what the last item on this list would be.

Most implementations are autoregressive. Most major SOTA are decoder-only, as encoder-decoder models has lack behind due to their expensive encoding phase.

See this amazing visualisation from Brendan Bycroft

Currently, there is a rise for state-space models which shows promise in information-dense data

inference.

embedding

next-token prediction.