See also Paged Attention

constrained decoding.

speculative decoding

See slides

  • not all parameters are required for generations tokens
  • constraints tokens with low information-density

Ideas

Uses a small cheap “draft model” to generate candidate K tokens feed back to the large models in a batch

  • have a sort of sampling logics to get the probability of the next token, then forward passing for all later tokens.