See also: LLMs, embedding, visualisation from Brendan Bycroft
A multi-layer perceptron (MLP) architecture built on top of a multi-head attention mechanism (Vaswani et al., 2023) to signal high entropy tokens to be amplified and less important tokens to be diminished.
ELI5: Mom often creates a food list consists of of items to buy. Your job is to guess what the last item on this list would be.
Most implementations are autoregressive. Most major SOTA are decoder-only, as encoder-decoder models has lack behind due to their expensive encoding phase.
state-space models which address transformers’ efficiency issues in attention layers within information-dense data
memory limitations.
excerpt from arxiv
"How is LLaMa.cpp possible?"
— Andrej Karpathy (@karpathy) 15 août 2023
great post by @finbarrtimbers https://t.co/yF43inlY87
llama.cpp surprised many people (myself included) with how quickly you can run large LLMs on small computers, e.g. 7B runs @ ~16 tok/s on a MacBook. Wait don't you need supercomputers to work… pic.twitter.com/EIp9iPkZ6x
inference.
Either compute-bound (batch inference, saturated usage) or memory-bound (latency)
speculative decoding ⇒ memory-bound (to saturate FLOPs)
next-token prediction.
Sampling: we essentially look forward K-tokens, and then we sample from the distribution of the next token.
Byte-Latent Transformer
idea: learn from raw-bytes and skip tokenizer/detokenizer protocol.
Feynman-Kac
Let be the vocab of given transformers model, and the set of multi-token strings. Assume contains token EOS
and write for the set of EOS
-terminated strings.
Feynman-Kac Transformer model
is a tuple where:
- is an initial state, which will take as empty string
- is a Markov kernel from to , parameterised by a transformer network mapping non-
EOS
-terminated strings to vectors of logits- is a potential function, mapping a pair to a real-valued non-negative score.
Goal: generate from distribution that reweights Markove chain by potential functions . We define step-t filtering posteriors:
Given that is mostly finite we can then define overall posterior (Lew et al., 2023, p. see 2.2 for examples)
Bibliographie
- Lew, A. K., Zhi-Xuan, T., Grand, G., & Mansinghka, V. K. (2023). Sequential Monte Carlo Steering of Large Language Models using Probabilistic Programs. arXiv preprint arXiv:2306.03081 [arXiv]
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2023). Attention Is All You Need. arXiv preprint arXiv:1706.03762 [arXiv]