Paper: Attention is all you need

See also Transformers, or distraction

Self-attention

Muti-head Attention

RingAttention

Paged Attention

paper

Used in conjunction with Continuous batching

Reduce memory usage of attention mechanism by swapping kv-cache in and out of memory. A block manager is similar to those of virtual memory in OS.

Essentially, it’s a form of paging, such that attention can be stored in contiguous memory. Partitions the KV cache of each sequence into KV blocks.

Given:

  • each block contains KV vectors for fixed number of tokens, denoted as block size .
  • Key block
  • Value block

where is row vector of attention score on j-th KV block.