See also Paged Attention
constrained decoding.
speculative decoding
See slides
Speculative execution for LLMs is an excellent inference-time optimization.
— Andrej Karpathy (@karpathy) 31 août 2023
It hinges on the following unintuitive observation: forwarding an LLM on a single input token takes about as much time as forwarding an LLM on K input tokens in a batch (for larger K than you might… https://t.co/FiwTwqsfho
- not all parameters are required for generations tokens
- constraints tokens with low information-density
Ideas
Uses a small cheap “draft model” to generate candidate K tokens ⇒ feed back to the large models in a batch
- have a sort of sampling logics to get the probability of the next token, then forward passing for all later tokens.